Re: Spark on Oracle available as an Apache licensed open source repo

2022-01-14 Thread Harish Butani
Look at the pushdown plans for all the TPCDS queries here 
<https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
We push Joins, Aggregates, Windowing etc, as I said we can do complete pushdown 
of 95 of 99 TPCDS queries.
The Generic JDBC Datasource push single table scans, filters and partial 
aggregates. In that case a lot of data is moved from the Oracle instance to 
Spark, during query execution.

Beyond this, the SQL Macro 
<https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros> feature can 
translate certain kinds of UDFs to Oracle expressions, which again avoids a lot 
of data movement because instead of UDF
 execution happening in Spark an equivalent Oracle expression is evaluated in 
Oracle.

This works on-premise Oracle, currently tested on 19c.

regards,
Harish.

> On Jan 14, 2022, at 2:51 AM, Mich Talebzadeh  
> wrote:
> 
> Hello,
> 
> Thanks for this info.
> 
> Have you tested this feature on Oracle on-premise say, 11c, 12c besides ADW 
> in Cloud?
> 
> I can see the transactional feature useful in terms of commit/rollback to 
> Oracle but I cannot figure out the performance gains in your blog etc.
> 
> My concern is we currently connect to Oracle as well as many other JDBC 
> compliant databases  through Spark generic JDBC connections 
> <https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html> with the 
> same look and feel. Unless there is an overriding reason, I don't  see why 
> there is a need to switch to this feature.
> 
> 
> Cheers
> 
>view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Fri, 14 Jan 2022 at 00:50, Harish Butani  <mailto:rhbutani.sp...@gmail.com>> wrote:
> Spark on Oracle is now available as an open source Apache licensed github 
> repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an 
> extension jar in your Spark clusters.
> 
> Use it to combine Apache Spark programs with data in your existing Oracle 
> databases without expensive data copying or query time data movement. 
> 
> The core capability is Optimizer extensions that collapse SQL operator 
> sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical 
> plan parallelism  
> <https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be 
> controlled to split Spark tasks to operate on Oracle data block ranges, or on 
> resultset pages or on table partitions.
> 
> We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS 
> queries are completely pushed to Oracle. 
> <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
> 
> With Spark SQL macros 
> <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros>  you can write 
> custom Spark UDFs that get translated and pushed as Oracle SQL expressions. 
> 
> With DML pushdown <https://github.com/oracle/spark-oracle/wiki/DML-Support> 
> inserts in Spark SQL get pushed as transactionally consistent inserts/updates 
> on Oracle tables.
> 
> See Quick Start Guide 
> <https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide>  on how to 
> set up an Oracle free tier ADW instance, load it with TPCDS data and try out 
> the Spark on Oracle Demo <https://github.com/oracle/spark-oracle/wiki/Demo>  
> on your Spark cluster. 
> 
> More  details can be found in our blog 
> <https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and the 
> project wiki. <https://github.com/oracle/spark-oracle/wiki>
> 
> regards,
> Harish Butani



Spark on Oracle available as an Apache licensed open source repo

2022-01-13 Thread Harish Butani
Spark on Oracle is now available as an open source Apache licensed github
repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an
extension jar in your Spark clusters.

Use it to combine Apache Spark programs with data in your existing Oracle
databases without expensive data copying or query time data movement.

The core capability is Optimizer extensions that collapse SQL operator
sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical
plan parallelism
<https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be
controlled to split Spark tasks to operate on Oracle data block ranges, or
on resultset pages or on table partitions.

We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS
queries are completely pushed to Oracle.
<https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>

With Spark SQL macros
<https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros>  you can
write custom Spark UDFs that get translated and pushed as Oracle SQL
expressions.

With DML pushdown <https://github.com/oracle/spark-oracle/wiki/DML-Support>
inserts in Spark SQL get pushed as transactionally consistent
inserts/updates on Oracle tables.

See Quick Start Guide
<https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide>  on how to
set up an Oracle free tier ADW instance, load it with TPCDS data and try
out the Spark on Oracle Demo
<https://github.com/oracle/spark-oracle/wiki/Demo>  on your Spark cluster.

More  details can be found in our blog
<https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and
the project
wiki. <https://github.com/oracle/spark-oracle/wiki>

regards,
Harish Butani


Spark on your Oracle Data Warehouse

2021-03-23 Thread Harish Butani
I have been developing 'Spark on Oracle', a project to provide better
integration of Spark into an Oracle Data Warehouse. You can read about it
at https://hbutani.github.io/spark-on-oracle/blog/Spark_on_Oracle_Blog.html

The key features are Catalog Integration, translation and pushdown of Spark
SQL to Oracle SQL/PL-SQL, Language Integration and Runtime Integration.

These are provided as Spark extensions via a Catalog Plugin, v2 DataSource,
Logical and Physical Planner Rules, Parser Extension, automatic Function
Registration and Spark SQL Macros(a generic Spark capability we have
developed).

The vision is to enable Oracle customers to deploy Spark Applications that
take full advantage of the data and capabilities of their Oracle Data
Warehouse; and also make Spark cluster operations simpler and unified with
their existing Oracle warehouse operations.

Looking for suggestion, comments from the Spark community.

regards,
Harish Butani.


Spark SQL Macros

2021-02-19 Thread Harish Butani
Hi,

I have been working on Spark SQL Macros
https://github.com/hbutani/spark-sql-macros

Spark SQL Macros provide a capability to register custom functions into a
Spark Session that is similar to UDF Registration. The difference being
that the SQL Macros registration mechanism attempts to translate the
function body to an equivalent Spark catalyst Expression with
holes(MarcroArg catalyst expressions). Under the covers SQLMacro is a set
of scala blackbox macros that attempt to rewrite the scala function body
AST into an equivalent catalyst Expression.

The generated expression is encapsulated in a FunctionBuilder that is
registered in Spark's FunctionRegistry. Then any function invocation is
replaced by the equivalent catalyst Expression with the holes replaced by
the calling site arguments.

There are 2 potential performance benefits for replacing function calls
with native catalyst expressions:
- evaluation performance. Since we avoid the SerDe cost at the function
boundary.
- More importantly, since the plan has native catalyst expressions more
optimizations are possible.
  - For example see the taxRate example where discount calculation is
eliminated.
  - Pushdown of operations to Datsources has a huge impact. For example see
the
Oracle SQL generated and  pushed when a macro is defined instead of a
UDF.

So this has potential benefits for developers of custom functions and
DataSources. I am looking for feedback from the community on potential use
cases and features to develop.

To use this functionality:
- build the jar by issuing sbt sql/assembly or download from the releases
page.
- the jar can be added into any Spark env.
- We have developed with spark-3.1.0 dependency.

The README provides more details, and examples.
This page (
https://github.com/hbutani/spark-sql-macros/wiki/Spark_SQL_Macro_examples)
provides even more examples.

regards,
Harish Butani.


Re: Design patterns involving Spark

2017-04-12 Thread Harish Butani
BTW, we now support OLAP functionality natively in spark w/o the need for
Druid, through our Spark native BI platform(SNAP):
https://www.linkedin.com/pulse/integrated-business-intelligence-big-data-stacks-harish-butani

 - we provide SQL commands to: create star schema, create olap index, and
insert into olap index. So you can be up and running very quickly in a
Spark env.
- Query Acceleration is provided through an OLAP Index FileFormat and Query
Optimizer extensions(just like spark-druid-olap).
- We have also posted details on a BI Benchmark
<https://www.linkedin.com/pulse/integrated-business-intelligence-big-data-stacks-harish-butani>
to quantify
query acceleration and cost.
- haven't looked at integration with Spark Streaming yet, but since we have
a FileFormat should be possible to integrate. Please ping me if this is of
interest.

regards,
Harish.


On Mon, Aug 29, 2016 at 7:19 PM, Chanh Le  wrote:

> Hi everyone,
>
> Seems a lot people using Druid for realtime Dashboard.
> I’m just wondering of using Druid for main storage engine because Druid
> can store the raw data and can integrate with Spark also (theoretical).
> In that case do we need to store 2 separate storage Druid (store segment
> in HDFS) and HDFS?.
> BTW did anyone try this one https://github.com/
> SparklineData/spark-druid-olap?
>
>
> Regards,
> Chanh
>
>
> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
> wrote:
>
> Thanks Bhaarat and everyone.
>
> This is an updated version of the same diagram
>
> 
> ​​​
> The frequency of Recent data is defined by the Windows length in Spark
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
> move any Spark granularity below 0.5 seconds in anger. For some
> applications like Credit card transactions and fraud detection. Data is
> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
> well. The same Spark Streaming will write asynchronously to HDFS Hive
> tables.
> One school of thought is never write to Hive from Spark, write  straight
> to Hbase and then read Hbase tables into Hive periodically?
>
> Now the third component in this layer is Serving Layer that can combine
> data from the current (Hbase) and the historical (Hive tables) to give the
> user visual analytics. Now that visual analytics can be Real time dashboard
> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
> offering or Data from Hbase (Red Box) combined with Hive tables.
>
> I am not aware of any industrial strength Real time Dashboard.  The idea
> is that one uses such dashboard in real time. Dashboard in this sense
> meaning a general purpose API to data store of some type like on Serving
> layer to provide visual analytics real time on demand, combining real time
> data and aggregate views. As usual the devil in the detail.
>
>
>
> Let me know your thoughts. Anyway this is first cut pattern.
>
> ​​
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 August 2016 at 18:53, Bhaarat Sharma  wrote:
>
>> Hi Mich
>>
>> This is really helpful. I'm trying to wrap my head around the last
>> diagram you shared (the one with kafka). In this diagram spark streaming is
>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>> Queries, Dashboards" annotation. Based on this diagram, will real time
>> queries be running on Spark or HBase?
>>
>> PS: My intention was not to steer the conversation away from what Ashok
>> asked but I found the diagrams shared by Mich very insightful.
>>
>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> In terms of positioning, Spark is really the first Big Data platform to
>>> integrate batch, streaming and interactive computations in a unified
>>> framework. What this boils down to is the fact that whichever way one look
>>> at it there is somewhere that Spark can make a contribution to. In general,
>>> there are few design patterns common to Big Data
>>>
>>>
>>>
>>>- *ET

Re: Spark + Druid

2015-09-18 Thread Harish Butani
Hi,

I have just posted a Blog on this:
https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani

regards,
Harish Butani.

On Tue, Sep 1, 2015 at 11:46 PM, Paolo Platter 
wrote:

> Fantastic!!! I will look into that and I hope to contribute
>
> Paolo
>
> Inviata dal mio Windows Phone
> ------
> Da: Harish Butani 
> Inviato: ‎02/‎09/‎2015 06:04
> A: user 
> Oggetto: Spark + Druid
>
> Hi,
>
> I am working on the Spark Druid Package:
> https://github.com/SparklineData/spark-druid-olap.
> For scenarios where a 'raw event' dataset is being indexed in Druid it
> enables you to write your Logical Plans(queries/dataflows) against the 'raw
> event' dataset and it rewrites parts of the plan to execute as a Druid
> Query. In Spark the configuration of a Druid DataSource is somewhat like
> configuring an OLAP index in a traditional DB. Early results show
> significant speedup of pushing slice and dice queries to Druid.
>
> It comprises of a Druid DataSource that wraps the 'raw event' dataset and
> has knowledge of the Druid Index; and a DruidPlanner which is a set of plan
> rewrite strategies to convert Aggregation queries into a Plan having a
> DruidRDD.
>
> Here
> <https://github.com/SparklineData/spark-druid-olap/blob/master/docs/SparkDruid.pdf>
>  is
> a detailed design document, which also describes a benchmark of
> representative queries on the TPCH dataset.
>
> Looking for folks who would be willing to try this out and/or contribute.
>
> regards,
> Harish Butani.
>


Spark + Druid

2015-09-01 Thread Harish Butani
Hi,

I am working on the Spark Druid Package:
https://github.com/SparklineData/spark-druid-olap.
For scenarios where a 'raw event' dataset is being indexed in Druid it
enables you to write your Logical Plans(queries/dataflows) against the 'raw
event' dataset and it rewrites parts of the plan to execute as a Druid
Query. In Spark the configuration of a Druid DataSource is somewhat like
configuring an OLAP index in a traditional DB. Early results show
significant speedup of pushing slice and dice queries to Druid.

It comprises of a Druid DataSource that wraps the 'raw event' dataset and
has knowledge of the Druid Index; and a DruidPlanner which is a set of plan
rewrite strategies to convert Aggregation queries into a Plan having a
DruidRDD.

Here
<https://github.com/SparklineData/spark-druid-olap/blob/master/docs/SparkDruid.pdf>
is
a detailed design document, which also describes a benchmark of
representative queries on the TPCH dataset.

Looking for folks who would be willing to try this out and/or contribute.

regards,
Harish Butani.


Re: Data frames select and where clause dependency

2015-07-20 Thread Harish Butani
Yes via:  org.apache.spark.sql.catalyst.optimizer.ColumnPruning
See DefaultOptimizer.batches for list of logical rewrites.

You can see the optimized plan by printing: df.queryExecution.optimizedPlan

On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller 
wrote:

>  Michael,
>
> How would the Catalyst optimizer optimize this version?
>
> df.filter(df("filter_field") === "value").select("field1").show()
>
> Would it still read all the columns in df or would it read only
> “filter_field” and “field1” since only two columns are used (assuming other
> columns from df are not used anywhere else)?
>
>
>
> Mohammed
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Friday, July 17, 2015 1:39 PM
> *To:* Mike Trienis
> *Cc:* user@spark.apache.org
> *Subject:* Re: Data frames select and where clause dependency
>
>
>
> Each operation on a dataframe is completely independent and doesn't know
> what operations happened before it.  When you do a selection, you are
> removing other columns from the dataframe and so the filter has nothing to
> operate on.
>
>
>
> On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis 
> wrote:
>
> I'd like to understand why the where field must exist in the select
> clause.
>
>
>
> For example, the following select statement works fine
>
>- df.select("field1", "filter_field").filter(df("filter_field") ===
>"value").show()
>
>  However, the next one fails with the error "in operator !Filter
> (filter_field#60 = value);"
>
>- df.select("field1").filter(df("filter_field") === "value").show()
>
>  As a work-around, it seems that I can do the following
>
>- df.select("field1", "filter_field").filter(df("filter_field") ===
>"value").drop("filter_field").show()
>
>
>
> Thanks, Mike.
>
>
>


Re: Joda Time best practice?

2015-07-20 Thread Harish Butani
Can you post details on how to reproduce the NPE

On Mon, Jul 20, 2015 at 1:19 PM, algermissen1971  wrote:

> Hi Harish,
>
> On 20 Jul 2015, at 20:37, Harish Butani  wrote:
>
> > Hey Jan,
> >
> > Can you provide more details on the serialization and cache issues.
>
> My symptom is that I have a Joda DateTime on which I can call toString and
> getMillis without problems, but when I call getYear I get a NPE out of the
> internal AbstractDateTime. Totally strange but seems to align with issues
> others have.
>
> I am now changing the app to work with millis internally, as that seems to
> be a performance improvement regarding serialization anyhow.
>
> Thanks,
>
> Jan
>
>
> >
> > If you are looking for datetime functionality with spark-sql please
> consider:  https://github.com/SparklineData/spark-datetime It provides a
> simple way to combine joda datetime expressions with spark sql.
> >
> > regards,
> > Harish.
> >
> > On Mon, Jul 20, 2015 at 7:37 AM, algermissen1971 <
> algermissen1...@icloud.com> wrote:
> > Hi,
> >
> > I am having trouble with Joda Time in a Spark application and saw by now
> that I am not the only one (generally seems to have to do with
> serialization and internal caches of the Joda Time objects).
> >
> > Is there a known best practice to work around these issues?
> >
> > Jan
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
>
>


Re: Joda Time best practice?

2015-07-20 Thread Harish Butani
Hey Jan,

Can you provide more details on the serialization and cache issues.

If you are looking for datetime functionality with spark-sql please
consider:  https://github.com/SparklineData/spark-datetime It provides a
simple way to combine joda datetime expressions with spark sql.

regards,
Harish.

On Mon, Jul 20, 2015 at 7:37 AM, algermissen1971  wrote:

> Hi,
>
> I am having trouble with Joda Time in a Spark application and saw by now
> that I am not the only one (generally seems to have to do with
> serialization and internal caches of the Joda Time objects).
>
> Is there a known best practice to work around these issues?
>
> Jan
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [SPARK-SQL] Window Functions optimization

2015-07-13 Thread Harish Butani
Just once.
You can see this by printing the optimized logical plan.
You will see just one repartition operation.

So do:
val df = sql("your sql...")
println(df.queryExecution.analyzed)

On Mon, Jul 13, 2015 at 6:37 AM, Hao Ren  wrote:

> Hi,
>
> I would like to know: Is there any optimization has been done for window
> functions in Spark SQL?
>
> For example.
>
> select key,
> max(value1) over(partition by key) as m1,
> max(value2) over(partition by key) as m2,
> max(value3) over(partition by key) as m3
> from table
>
> The query above creates 3 fields based on the same partition rule.
>
> The question is:
> Will spark-sql partition the table 3 times in the same way to get the three
> max values ? or just partition once if it finds the partition rule is the
> same ?
>
> It would be nice if someone could point out some lines of code on it.
>
> Thank you.
> Hao
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Window-Functions-optimization-tp23796.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark query

2015-07-08 Thread Harish Butani
try the spark-datetime package:
https://github.com/SparklineData/spark-datetime
Follow this example
https://github.com/SparklineData/spark-datetime#a-basic-example to get the
different attributes of a DateTime.

On Wed, Jul 8, 2015 at 9:11 PM, prosp4300  wrote:

> As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs,
> please take a look below builtin UDFs of Hive, get day of year should be as
> simply as existing RDBMS
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
>
>
> At 2015-07-09 12:02:44, "Ravisankar Mani"  wrote:
>
> Hi everyone,
>
> I can't get 'day of year'  when using spark query. Can you help any way to
> achieve day of year?
>
> Regards,
> Ravi
>
>
>
>