unsubscribe

2023-03-06 Thread Shiv Prashant Sood
On Sun, 5 Mar 2023 at 18:27, zhangliyun  wrote:

> Hi all
>
>
>   i have a spark sql  , before in  spark 2.4.2 it runs correctly, when i
> upgrade to spark 3.1.3, it has some problem.
>
>  the sql
>
>  ```
>
> select * from eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly
> where dt >= date_sub('${today}',30);
>
>
> ```
>
> it will load the data of past 30 days of table
> eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly, here
> today='2023-03-01'
>
>
> in spark2  i saw the physical plan  the partition Filter is PartitionFilters:
> [isnotnull(dt#1461), (dt#1461 >= 2023-01-31)]
>
>  +- *(4) FileScan parquet 
> eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly[disputeid#1327,statuswork#1330,opTs#1457,trailSeqno#1459,trailRba#1460,dt#1461,hr#1462]
>  Batched: true, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[gs://pypl-bkt-prd-row-std-gds-non-edw-tables/apps/risk/eds/eds_risk/eds_r...,
>  PartitionCount: 805, PartitionFilters: [isnotnull(dt#1461), (dt#1461 >= 
> 2023-01-31)], PushedFilters: [IsNotNull(disputeid)], ReadSchema: 
> struct
>
>
>
> in spark3 , i saw the physical plan ,  the partitionFilter is 
> [isnotnull(dt#1602),
> (cast(dt#1602 as date) >= 19387)]
> ```
>
> (8) Scan parquet eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly
> Output [7]: [disputeid#1468, statuswork#1471, opTs#1598, trailSeqno#1600, 
> trailRba#1601, dt#1602, hr#1603]
> Batched: true
> Location: InMemoryFileIndex 
> [gs://pypl-bkt-prd-row-std-gds-non-edw-tables/apps/risk/eds/eds_risk/eds_rds/cdh/prpc63cgudba_pp_index_disputecasedetails/dt=2023-01-30/hr=00,
>  ... 784 entries]
> PartitionFilters: [isnotnull(dt#1602), (cast(dt#1602 as date) >= 19387)]
> PushedFilters: [IsNotNull(disputeid)]
> ReadSchema: 
> struct
>
>
> ```
>
> here i want to ask why there is big difference in partitionFitler in
> spark2 and spark3,  i guess most my spark configure is similar in spark2
> and spark3 to run the same sql
>


Re: DataSourceV2 : Transactional Write support

2019-08-05 Thread Shiv Prashant Sood
Thanks all for the clarification.

Regards,
Shiv

On Sat, Aug 3, 2019 at 12:49 PM Ryan Blue  wrote:

> > What you could try instead is intermediate output: inserting into
> temporal table in executors, and move inserted records to the final table
> in driver (must be atomic)
>
> I think that this is the approach that other systems (maybe sqoop?) have
> taken. Insert into independent temporary tables, which can be done quickly.
> Then for the final commit operation, union and insert into the final table.
> In a lot of cases, JDBC databases can do that quickly as well because the
> data is already on disk and just needs to added to the final table.
>
> On Fri, Aug 2, 2019 at 7:25 PM Jungtaek Lim  wrote:
>
>> I asked similar question for end-to-end exactly-once with Kafka, and
>> you're correct distributed transaction is not supported. Introducing
>> distributed transaction like "two-phase commit" requires huge change on
>> Spark codebase and the feedback was not positive.
>>
>> What you could try instead is intermediate output: inserting into
>> temporal table in executors, and move inserted records to the final table
>> in driver (must be atomic).
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Sat, Aug 3, 2019 at 4:56 AM Shiv Prashant Sood 
>> wrote:
>>
>>> All,
>>>
>>> I understood that DataSourceV2 supports Transactional write and wanted
>>> to  implement that in JDBC DataSource V2 connector ( PR#25211
>>> <https://github.com/apache/spark/pull/25211> ).
>>>
>>> Don't see how this is feasible for JDBC based connector.  The FW suggest
>>> that EXECUTOR send a commit message  to DRIVER, and actual commit
>>> should only be done by DRIVER after receiving all commit confirmations.
>>> This will not work for JDBC  as commits have to happen on the JDBC
>>> Connection which is maintained by the EXECUTORS and JDBCConnection  is not
>>> serializable that it can be sent to the DRIVER.
>>>
>>> Am i right in thinking that this cannot be supported for JDBC? My goal
>>> is to either fully write or roll back the dataframe write operation.
>>>
>>> Thanks in advance for your help.
>>>
>>> Regards,
>>> Shiv
>>>
>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


DataSourceV2 : Transactional Write support

2019-08-02 Thread Shiv Prashant Sood
 All,

I understood that DataSourceV2 supports Transactional write and wanted to
implement that in JDBC DataSource V2 connector ( PR#25211
 ).

Don't see how this is feasible for JDBC based connector.  The FW suggest
that EXECUTOR send a commit message  to DRIVER, and actual commit should
only be done by DRIVER after receiving all commit confirmations. This will
not work for JDBC  as commits have to happen on the JDBC Connection which
is maintained by the EXECUTORS and JDBCConnection  is not serializable that
it can be sent to the DRIVER.

Am i right in thinking that this cannot be supported for JDBC? My goal is
to either fully write or roll back the dataframe write operation.

Thanks in advance for your help.

Regards,
Shiv


Re: JDBC connector for DataSourceV2

2019-07-15 Thread Shiv Prashant Sood
Agree. Let's use SPARK-24907
<https://issues.apache.org/jira/browse/SPARK-24907> as the JIRA for this
work. Thanks for resolving SPARK-28380
<https://issues.apache.org/jira/browse/SPARK-28380> as dupe of this.

Regards,
Shiv

On Mon, Jul 15, 2019 at 1:50 AM Gabor Somogyi 
wrote:

> I've had a look at the jiras and seems like the intention is the same
> (correct me if I'm wrong).
> I think one is enough and the rest can be closed with duplicate.
> We should keep multiple jiras only when the intention is different.
>
> BR,
> G
>
>
> On Mon, Jul 15, 2019 at 6:01 AM Xianyin Xin 
> wrote:
>
>> There’s another pr https://github.com/apache/spark/pull/21861 but which
>> is based the old V2 APIs.
>>
>>
>>
>> We’d better link the JIRAs, SPARK-24907
>> <https://issues.apache.org/jira/browse/SPARK-24907>, SPARK-25547
>> <https://issues.apache.org/jira/browse/SPARK-25547>, and SPARK-28380
>> <https://issues.apache.org/jira/browse/SPARK-28380> and finalize a plan.
>>
>>
>>
>> Xianyin
>>
>>
>>
>> *From: *Shiv Prashant Sood 
>> *Date: *Sunday, July 14, 2019 at 2:59 AM
>> *To: *Gabor Somogyi 
>> *Cc: *Xianyin Xin , Ryan Blue <
>> rb...@netflix.com>, , Spark Dev List <
>> dev@spark.apache.org>
>> *Subject: *Re: JDBC connector for DataSourceV2
>>
>>
>>
>> To me this looks like refactoring of DS1 JDBC to enable user provided
>> connection factories. In itself a good change, but IMO not DSV2 related.
>>
>>
>>
>> I created a JIRA and added some goals. Please comments/add as relevant.
>>
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-28380
>>
>>
>>
>> JIRA for DataSourceV2 API based JDBC connector.
>>
>> Goals :
>>
>>- Generic connector based on JDBC that supports all databases (min
>>bar is support for all V1 data bases).
>>- Reference implementation and Interface for any specialized JDBC
>>connectors.
>>
>>
>>
>> Regards,
>>
>> Shiv
>>
>>
>>
>> On Sat, Jul 13, 2019 at 2:17 AM Gabor Somogyi 
>> wrote:
>>
>> Hi Guys,
>>
>>
>>
>> Don't know what's the intention exactly here but there is such a PR:
>> https://github.com/apache/spark/pull/22560
>>
>> If that's what we need maybe we can resurrect it. BTW, I'm also
>> interested in...
>>
>>
>>
>> BR,
>>
>> G
>>
>>
>>
>>
>>
>> On Sat, Jul 13, 2019 at 4:09 AM Shiv Prashant Sood <
>> shivprash...@gmail.com> wrote:
>>
>> Thanks all. I can also contribute toward this effort.
>>
>>
>>
>> Regards,
>>
>> Shiv
>>
>> Sent from my iPhone
>>
>>
>> On Jul 12, 2019, at 6:51 PM, Xianyin Xin 
>> wrote:
>>
>> If there’s nobody working on that, I’d like to contribute.
>>
>>
>>
>> Loop in @Gengliang Wang.
>>
>>
>>
>> Xianyin
>>
>>
>>
>> *From: *Ryan Blue 
>> *Reply-To: *
>> *Date: *Saturday, July 13, 2019 at 6:54 AM
>> *To: *Shiv Prashant Sood 
>> *Cc: *Spark Dev List 
>> *Subject: *Re: JDBC connector for DataSourceV2
>>
>>
>>
>> I'm not aware of a JDBC connector effort. It would be great to have
>> someone build one!
>>
>>
>>
>> On Fri, Jul 12, 2019 at 3:33 PM Shiv Prashant Sood <
>> shivprash...@gmail.com> wrote:
>>
>> Can someone please help understand the current Status of DataSource V2
>> based JDBC connector? I see connectors for various file formats in Master,
>> but can't find a JDBC implementation or related JIRA.
>>
>>
>>
>> DatasourceV2 APIs to me look in good shape to attempt a JDBC connector
>> for READ/WRITE path.
>>
>> Thanks & Regards,
>>
>> Shiv
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>


Re: JDBC connector for DataSourceV2

2019-07-13 Thread Shiv Prashant Sood
To me this looks like refactoring of DS1 JDBC to enable user provided
connection factories. In itself a good change, but IMO not DSV2 related.

I created a JIRA and added some goals. Please comments/add as relevant.

https://issues.apache.org/jira/browse/SPARK-28380

JIRA for DataSourceV2 API based JDBC connector.

Goals :

   - Generic connector based on JDBC that supports all databases (min bar
   is support for all V1 data bases).
   - Reference implementation and Interface for any specialized JDBC
   connectors.


Regards,
Shiv

On Sat, Jul 13, 2019 at 2:17 AM Gabor Somogyi 
wrote:

> Hi Guys,
>
> Don't know what's the intention exactly here but there is such a PR:
> https://github.com/apache/spark/pull/22560
> If that's what we need maybe we can resurrect it. BTW, I'm also interested
> in...
>
> BR,
> G
>
>
> On Sat, Jul 13, 2019 at 4:09 AM Shiv Prashant Sood 
> wrote:
>
>> Thanks all. I can also contribute toward this effort.
>>
>> Regards,
>> Shiv
>>
>> Sent from my iPhone
>>
>> On Jul 12, 2019, at 6:51 PM, Xianyin Xin 
>> wrote:
>>
>> If there’s nobody working on that, I’d like to contribute.
>>
>>
>>
>> Loop in @Gengliang Wang.
>>
>>
>>
>> Xianyin
>>
>>
>>
>> *From: *Ryan Blue 
>> *Reply-To: *
>> *Date: *Saturday, July 13, 2019 at 6:54 AM
>> *To: *Shiv Prashant Sood 
>> *Cc: *Spark Dev List 
>> *Subject: *Re: JDBC connector for DataSourceV2
>>
>>
>>
>> I'm not aware of a JDBC connector effort. It would be great to have
>> someone build one!
>>
>>
>>
>> On Fri, Jul 12, 2019 at 3:33 PM Shiv Prashant Sood <
>> shivprash...@gmail.com> wrote:
>>
>> Can someone please help understand the current Status of DataSource V2
>> based JDBC connector? I see connectors for various file formats in Master,
>> but can't find a JDBC implementation or related JIRA.
>>
>>
>>
>> DatasourceV2 APIs to me look in good shape to attempt a JDBC connector
>> for READ/WRITE path.
>>
>> Thanks & Regards,
>>
>> Shiv
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>


Re: JDBC connector for DataSourceV2

2019-07-12 Thread Shiv Prashant Sood
Thanks all. I can also contribute toward this effort.

Regards,
Shiv

Sent from my iPhone

> On Jul 12, 2019, at 6:51 PM, Xianyin Xin  wrote:
> 
> If there’s nobody working on that, I’d like to contribute.
>  
> Loop in @Gengliang Wang.
>  
> Xianyin
>  
> From: Ryan Blue 
> Reply-To: 
> Date: Saturday, July 13, 2019 at 6:54 AM
> To: Shiv Prashant Sood 
> Cc: Spark Dev List 
> Subject: Re: JDBC connector for DataSourceV2
>  
> I'm not aware of a JDBC connector effort. It would be great to have someone 
> build one!
>  
> On Fri, Jul 12, 2019 at 3:33 PM Shiv Prashant Sood  
> wrote:
> Can someone please help understand the current Status of DataSource V2 based 
> JDBC connector? I see connectors for various file formats in Master, but 
> can't find a JDBC implementation or related JIRA.
>  
> DatasourceV2 APIs to me look in good shape to attempt a JDBC connector for 
> READ/WRITE path.
> 
> Thanks & Regards,
> Shiv
> 
>  
> --
> Ryan Blue
> Software Engineer
> Netflix


JDBC connector for DataSourceV2

2019-07-12 Thread Shiv Prashant Sood
Can someone please help understand the current Status of DataSource V2
based JDBC connector? I see connectors for various file formats in Master,
but can't find a JDBC implementation or related JIRA.

DatasourceV2 APIs to me look in good shape to attempt a JDBC connector for
READ/WRITE path.

Thanks & Regards,
Shiv