Re: DataSourceV2 sync notes (#4)

2018-12-18 Thread Srabasti Banerjee
 Thanks for sending out the meeting notes from last week's discussion Ryan!
For technical unknown reasons, I could not unmute myself and be heard when I 
was trying to pitch in during one of the topic discussions regarding default 
value handling for traditional databases. Had posted response in chat. 

My 2 cents regarding traditional database handling for default values - From my 
industry experience, Oracle has a constraint clause "ENABLE NOVALIDATE" that 
enables new rows to be added going forward to be added with default value.  
Previous older rows/data are not required to be updated a default value. One 
can choose to do a data fix, at any point though.

Happy Holidays All in advance :-)

Warm Regards,
Srabasti Banerjee
On Tuesday, 18 December, 2018, 4:15:06 PM GMT-8, Ryan Blue 
 wrote:  
 
 
Hi everyone, sorry these notes are late. I didn’t have the time to write this 
up last week.

For anyone interested in the next sync, we decided to skip next week and resume 
in early January. I’ve already sent the invite. As usual, if you have topics 
you’d like to discuss or would like to be added to the invite list, just let me 
know. Everyone is welcome.

rb

Attendees:
Ryan Blue
Xiao Li
Bruce Robbins
John Zhuge
Anton Okolnychyi
Jackey Lee
Jamison Bennett
Srabasti Banerjee
Thomas D’Silva
Wenchen Fan
Matt Cheah
Maryann Xue
(possibly others that entered after the start)

Agenda:
   
   - Current discussions from the v2 batch write PR: WriteBuilder and SaveMode
   - Continue sql-api discussion after looking at API dependencies
   - Capabilities API
   - Overview of TableCatalog proposal to sync understanding (if time)

Notes:
   
   - WriteBuilder:  
  - Wenchen summarized the options (factory methods vs builder) and some 
trade-offs
  - What we need to accomplish now can be done with factory methods, which 
are simpler
  - A builder matches the structure of the read side
  - Ryan’s opinion is to use the builder for consistency and evolution. 
Builder makes it easier to change or remove parts without copying all of the 
args of a method.
  - Matt’s opinion is that evolution and maintenance is easier and good to 
match the read side
  - Consensus was to use WriteBuilder instead of factory methods

   - SaveMode:  
  - Context: v1 passes SaveMode from the DataFrameWriter API to sources. 
The action taken for some mode and existing table state depends on the source 
implementation, which is something the community wants to fix in v2. But, v2 
initially passed SaveMode to sources. The question is how and when to remove 
SaveMode.
  - Wenchen: the current API uses SaveMode and we don’t want to drop 
features
  - Ryan: The main requirement is removing this before the next release. We 
should not have a substantial API change without removing it because we would 
still require an API change.
  - Xiao: suggested creating a release-blocking issue.
  - Consensus was to remove SaveMode before the next release, blocking if 
needed.
  - Someone also stated that keeping SaveMode would make porting file 
sources to v2 easier
  - Ryan disagrees that using SaveMode makes porting file sources faster or 
easier.

   - Capatbilities API (this is a quick overview of a long conversation)  
  - Context: there are several situations where a source needs to change 
how Spark behaves or Spark needs to check whether a source supports some 
feature. For example, Spark checks whether a source supports batch writes, 
write-only sources that do not need validation need to tell Spark not to run 
validation rules, and sources that can read files with missing columns (e.g., 
Iceberg) need Spark to allow writes that are missing columns if those columns 
are optional or have default values.
  - Xiao suggested handling this case by case and the conversation moved to 
discussing the motivating case for Netflix: allowing writes that do not include 
optional columns.
  - Wenchen and Maryann added that Spark should handle all default values 
so that this doesn’t differ across sources. Ryan agreed that would be good, but 
pointed out challenges.
  - There was a long discussion about how Spark could handle default 
values. The problem is that adding a column with a default creates a problem of 
reading older data. Maryann and Dilip pointed out that traditional databases 
handle default values at write time so the correct default is the default value 
at write time (instead of read time), but it is unclear how existing data is 
handled.
  - Matt and Ryan asked whether databases update existing rows when a 
default is added. But even if a database can update all existing rows, that 
would not be reasonable for Spark, which in the worst case would need to update 
millions of immutable files. This is also not a reasonable requirement to put 
on sources, so Spark would need to have read-side defaults.
  - Xiao noted that it may be easier to trea

Re: DataSourceV2 sync today

2018-11-14 Thread Srabasti Banerjee
 Thanks for the info Ryan :-)
Will join on Hangouts!
Warm Regards,Srabasti Banerjee

On Wednesday, 14 November, 2018, 4:13:32 PM GMT-8, Ryan Blue 
 wrote:  
 
 Srabasti,
It looks like the live stream only works within the host's domain. Everyone 
should just join the meet/hangout.
On Wed, Nov 14, 2018 at 4:08 PM Srabasti Banerjee  wrote:

 Hi All,
I am trying to view using gmail and see following message as below.
Anyone getting the same error?
Are there any alternate options? Any number I can dial in or Webex that I can 
attend?
Thanks for your help in advance :-)
Warm Regards,Srabasti Banerjee



On Wednesday, 14 November, 2018, 9:44:11 AM GMT-8, Ryan Blue 
 wrote:  
 
 The live stream link for this is 
https://stream.meet.google.com/stream/6be59d80-04c7-44dc-9042-4f3b597fc8ba
Some people said that it didn't work last time. I'm not sure why that would 
happen, but I don't use these much so I'm no expert. If you can't join the live 
stream, then feel free to join the meet up.
I'll also plan on joining earlier than I did last time, in case we the 
meet/hangout needs to be up for people to view the live stream.
rb
On Tue, Nov 13, 2018 at 4:00 PM Ryan Blue  wrote:


Hi everyone,
I just wanted to send out a reminder that there’s a DSv2 sync tomorrow at 17:00 
PST, which is 01:00 UTC.

Here are some of the topics under discussion in the last couple of weeks:
   
   - Read API for v2 - see Wenchen’s doc
   - Capabilities API - see the dev list thread
   - Using CatalogTableIdentifier to reliably separate v2 code paths - see PR 
#21978
   - A replacement for InternalRow

I know that a lot of people are also interested in combining the source API for 
micro-batch and continuous streaming. Wenchen and I have been discussing a way 
to do that and Wenchen has added it to the Read API doc as Alternative #2. I 
think this would be a good thing to plan on discussing.

rb

Here’s some additional background on combining micro-batch and continuous APIs:

The basic idea is to update how tasks end so that the same tasks can be used in 
micro-batch or streaming. For tasks that are naturally limited like data files, 
when the data is exhausted, Spark stops reading. For tasks that are not 
limited, like a Kafka partition, Spark decides when to stop in micro-batch mode 
by hitting a pre-determined LocalOffset or Spark can just keep running in 
continuous mode.

Note that a task deciding to stop can happen in both modes, either when a task 
is exhausted in micro-batch or when a stream needs to be reconfigured in 
continuous.

Here’s the task reader API. The offset returned is optional so that a task can 
avoid stopping if there isn’t a resumeable offset, like if it is in the middle 
of an input file:
interface StreamPartitionReader extends InputPartitionReader {
  Optional currentOffset();
  boolean next() // from InputPartitionReader
  T get()// from InputPartitionReader
}

The streaming code would look something like this:
Stream stream = scan.toStream()
StreamReaderFactory factory = stream.createReaderFactory()

while (true) {
  Offset start = stream.currentOffset()
  Offset end = if (isContinuousMode) {
None
  } else {
// rate limiting would happen here
Some(stream.latestOffset())
  }

  InputPartition[] parts = stream.planInputPartitions(start)

  // returns when needsReconfiguration is true or all tasks finish
  runTasks(parts, factory, end)

  // the stream's current offset has been updated at the last epoch
}
-- 
Ryan BlueSoftware EngineerNetflix


-- 
Ryan BlueSoftware EngineerNetflix  


-- 
Ryan BlueSoftware EngineerNetflix  

Re: DataSourceV2 sync today

2018-11-14 Thread Srabasti Banerjee
 Hi All,
I am trying to view using gmail and see following message as below.
Anyone getting the same error?
Are there any alternate options? Any number I can dial in or Webex that I can 
attend?
Thanks for your help in advance :-)
Warm Regards,Srabasti Banerjee



On Wednesday, 14 November, 2018, 9:44:11 AM GMT-8, Ryan Blue 
 wrote:  
 
 The live stream link for this is 
https://stream.meet.google.com/stream/6be59d80-04c7-44dc-9042-4f3b597fc8ba
Some people said that it didn't work last time. I'm not sure why that would 
happen, but I don't use these much so I'm no expert. If you can't join the live 
stream, then feel free to join the meet up.
I'll also plan on joining earlier than I did last time, in case we the 
meet/hangout needs to be up for people to view the live stream.
rb
On Tue, Nov 13, 2018 at 4:00 PM Ryan Blue  wrote:


Hi everyone,
I just wanted to send out a reminder that there’s a DSv2 sync tomorrow at 17:00 
PST, which is 01:00 UTC.

Here are some of the topics under discussion in the last couple of weeks:
   
   - Read API for v2 - see Wenchen’s doc
   - Capabilities API - see the dev list thread
   - Using CatalogTableIdentifier to reliably separate v2 code paths - see PR 
#21978
   - A replacement for InternalRow

I know that a lot of people are also interested in combining the source API for 
micro-batch and continuous streaming. Wenchen and I have been discussing a way 
to do that and Wenchen has added it to the Read API doc as Alternative #2. I 
think this would be a good thing to plan on discussing.

rb

Here’s some additional background on combining micro-batch and continuous APIs:

The basic idea is to update how tasks end so that the same tasks can be used in 
micro-batch or streaming. For tasks that are naturally limited like data files, 
when the data is exhausted, Spark stops reading. For tasks that are not 
limited, like a Kafka partition, Spark decides when to stop in micro-batch mode 
by hitting a pre-determined LocalOffset or Spark can just keep running in 
continuous mode.

Note that a task deciding to stop can happen in both modes, either when a task 
is exhausted in micro-batch or when a stream needs to be reconfigured in 
continuous.

Here’s the task reader API. The offset returned is optional so that a task can 
avoid stopping if there isn’t a resumeable offset, like if it is in the middle 
of an input file:
interface StreamPartitionReader extends InputPartitionReader {
  Optional currentOffset();
  boolean next() // from InputPartitionReader
  T get()// from InputPartitionReader
}

The streaming code would look something like this:
Stream stream = scan.toStream()
StreamReaderFactory factory = stream.createReaderFactory()

while (true) {
  Offset start = stream.currentOffset()
  Offset end = if (isContinuousMode) {
None
  } else {
// rate limiting would happen here
Some(stream.latestOffset())
  }

  InputPartition[] parts = stream.planInputPartitions(start)

  // returns when needsReconfiguration is true or all tasks finish
  runTasks(parts, factory, end)

  // the stream's current offset has been updated at the last epoch
}
-- 
Ryan BlueSoftware EngineerNetflix


-- 
Ryan BlueSoftware EngineerNetflix  
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Happy Diwali everyone!!!

2018-11-08 Thread Srabasti Banerjee
 Thoughtful of you to remember Xiao :-)
Wish everyone a Happy & Prosperous Diwali !
ThanksSrabasti Banerjee

On Wednesday, 7 November, 2018, 3:12:01 PM GMT-8, Dilip Biswal 
 wrote:  
 
 Thank you Sean. Happy Diwali !! -- Dilip
- Original message -
From: Xiao Li 
To: "u...@spark.apache.org" , user 
Cc:
Subject: Happy Diwali everyone!!!
Date: Wed, Nov 7, 2018 3:10 PM
 Happy Diwali everyone!!! Xiao Li
 
-To 
unsubscribe e-mail: dev-unsubscr...@spark.apache.org  

Re: Timestamp Difference/operations

2018-10-15 Thread Srabasti Banerjee
Hi Paras,Check out the link Spark Scala: DateDiff of two columns by hour or 
minute  
|  
|   
|   
|   ||

   |

  |
|  
|   |  
Spark Scala: DateDiff of two columns by hour or minute
 
I have two timestamp columns in a dataframe that I'd like to get the minute 
difference of, or alternatively, the hour difference of. Currently I'm able to 
get the day difference, with rounding, by ...
  |   |

  |

  |

  
Looks like you can get the difference in seconds as well.Hopefully this helps! 
Are you looking for a specific usecase? Can you please elaborate with an 
example?
Thanks Srabasti Banerjee 

Sent from Yahoo Mail on Android 
 
  On Sun, Oct 14, 2018 at 23:41, Paras Agarwal 
wrote:   #yiv8627769989 #yiv8627769989 -- P 
{margin-top:0;margin-bottom:0;}#yiv8627769989 
Thanks John,




Actually need full date and  time difference not just date difference,


which I guess not supported.





Let me know if its possible, or any UDF available for the same.




Thanks And Regards,

Paras

From: John Zhuge 
Sent: Friday, October 12, 2018 9:48:47 PM
To: Paras Agarwal
Cc: user; dev
Subject: Re: Timestamp Difference/operations Yeah, operator "-" does not seem 
to be supported, however, you can use "datediff" function:
In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP), 
CAST('2000-01-01 00:00:00' AS 
TIMESTAMP))Out[9]:+--+|
 datediff(CAST(CAST(2000-02-01 12:34:34 AS TIMESTAMP) AS DATE), 
CAST(CAST(2000-01-01 00:00:00 AS TIMESTAMP) AS DATE)) 
|+--+|
 31                                                                             
                                      
|+--+
In [10]: select datediff('2000-02-01 12:34:34', '2000-01-01 
00:00:00')Out[10]:++|
 datediff(CAST(2000-02-01 12:34:34 AS DATE), CAST(2000-01-01 00:00:00 AS DATE)) 
|++|
 31                                                                             
|++
In [11]: select datediff(timestamp '2000-02-01 12:34:34', timestamp '2000-01-01 
00:00:00')Out[11]:+--+|
 datediff(CAST(TIMESTAMP('2000-02-01 12:34:34.0') AS DATE), 
CAST(TIMESTAMP('2000-01-01 00:00:00.0') AS DATE)) 
|+--+|
 31                                                                             
                              
|+--+
On Fri, Oct 12, 2018 at 7:01 AM Paras Agarwal  
wrote:


Hello Spark Community,

Currently in hive we can do operations on Timestamp Like :
CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS 
TIMESTAMP)

Seems its not supporting in spark.
Is there any way available.

Kindly provide some insight on this.


Paras
9130006036




-- 
John  


Re: Spark Streaming : Multiple sources found for csv : Error

2018-08-30 Thread Srabasti Banerjee
Hi Jorn,
Do you have suggestions as to how to do that? 

The conflicting packages are being picked up by default from pom.xml. I am not 
invoking any additional packages while running spark submit on the thin jar.

ThanksSrabasti Banerjee
   On Thursday, 30 August, 2018, 9:45:36 PM GMT-7, Jörn Franke 
 wrote:  
 
 Can’t you remove the dependency to the databricks CSV data source? Spark has 
them now integrated since some versions so it is not needed.
On 31. Aug 2018, at 05:52, Srabasti Banerjee  
wrote:


Hi,
I am trying to run below code to read file as a dataframe onto a Stream (for 
Spark Streaming) developed via Eclipse IDE, defining schemas appropriately, by 
running thin jar on server and am getting error below. Tried out suggestions 
from researching on internet based on "spark.read.option.schema.csv" similar 
errors with no success.
Am thinking this can be a bug as the changes might not have been done for 
readStream option? Has anybody encountered similar issue for Spark Streaming?
Looking forward to hear your response(s)!
ThanksSrabasti Banerjee

Error
Exception in thread "main" java.lang.RuntimeException: Multiple sources found 
for csv (com.databricks.spark.csv.DefaultSource15, 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat), please specify 
the fully qualified class name.

Code:  val csvdf = spark.readStream.option("sep", 
",").schema(userSchema).csv("server_path") //does not resolve error
val csvdf = spark.readStream.option("sep", 
",").schema(userSchema).format("com.databricks.spark.csv").csv("server_path") 
//does not resolve error val csvdf = spark.readStream.option("sep", 
",").schema(userSchema).csv("server_path") //does not resolve errorval csvdf = 
spark.readStream.option("sep", 
",").schema(userSchema).format("org.apache.spark.sql.execution.datasources.csv").csv("server_path")
 //does not resolve errorval csvdf = spark.readStream.option("sep", 
",").schema(userSchema).format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").csv("server_path")
 //does not resolve errorval csvdf = spark.readStream.option("sep", 
",").schema(userSchema).format("com.databricks.spark.csv.DefaultSource15").csv("server_path")
 //does not resolve error
    

    

  

Re: Spark Streaming : Multiple sources found for csv : Error

2018-08-30 Thread Srabasti Banerjee
Great we are already discussing/working to fix the issue.Happy to help if I can 
:-)

Any workarounds that we can use for now?
Please note I am not invoking any additional packages while running spark 
submit on the thin jar.
Thanks,Srabasti Banerjee





   On Thursday, 30 August, 2018, 9:02:11 PM GMT-7, Hyukjin Kwon 
 wrote:  
 
 Yea, this is exactly what I have been worried of the recent changes (discussed 
in https://issues.apache.org/jira/browse/SPARK-24924)See 
https://github.com/apache/spark/pull/17916. This should be fine in upper Spark 
versions.

FYI, +Wechen and DongjoonI want to add Thomas Graves and Gengliang Wang too but 
can't fine their email addresses.
2018년 8월 31일 (금) 오전 11:52, Srabasti Banerjee 님이 
작성:

Hi,
I am trying to run below code to read file as a dataframe onto a Stream (for 
Spark Streaming) developed via Eclipse IDE, defining schemas appropriately, by 
running thin jar on server and am getting error below. Tried out suggestions 
from researching on internet based on "spark.read.option.schema.csv" similar 
errors with no success.
Am thinking this can be a bug as the changes might not have been done for 
readStream option? Has anybody encountered similar issue for Spark Streaming?
Looking forward to hear your response(s)!
ThanksSrabasti Banerjee

Error
Exception in thread "main" java.lang.RuntimeException: Multiple sources found 
for csv (com.databricks.spark.csv.DefaultSource15, 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat), please specify 
the fully qualified class name.

Code:  val csvdf = spark.readStream.option("sep", 
",").schema(userSchema).csv("server_path") //does not resolve error
val csvdf = spark.readStream.option("sep", 
",").schema(userSchema).format("com.databricks.spark.csv").csv("server_path") 
//does not resolve error val csvdf = spark.readStream.option("sep", 
",").schema(userSchema).csv("server_path") //does not resolve errorval csvdf = 
spark.readStream.option("sep", 
",").schema(userSchema).format("org.apache.spark.sql.execution.datasources.csv").csv("server_path")
 //does not resolve errorval csvdf = spark.readStream.option("sep", 
",").schema(userSchema).format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").csv("server_path")
 //does not resolve errorval csvdf = spark.readStream.option("sep", 
",").schema(userSchema).format("com.databricks.spark.csv.DefaultSource15").csv("server_path")
 //does not resolve error
    

    

  

Spark Streaming : Multiple sources found for csv : Error

2018-08-30 Thread Srabasti Banerjee
Hi,
I am trying to run below code to read file as a dataframe onto a Stream (for 
Spark Streaming) developed via Eclipse IDE, defining schemas appropriately, by 
running thin jar on server and am getting error below. Tried out suggestions 
from researching on internet based on "spark.read.option.schema.csv" similar 
errors with no success.
Am thinking this can be a bug as the changes might not have been done for 
readStream option? Has anybody encountered similar issue for Spark Streaming?
Looking forward to hear your response(s)!
ThanksSrabasti Banerjee

Error
Exception in thread "main" java.lang.RuntimeException: Multiple sources found 
for csv (com.databricks.spark.csv.DefaultSource15, 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat), please specify 
the fully qualified class name.

Code:  val csvdf = spark.readStream.option("sep", 
",").schema(userSchema).csv("server_path") //does not resolve error
val csvdf = spark.readStream.option("sep", 
",").schema(userSchema).format("com.databricks.spark.csv").csv("server_path") 
//does not resolve error val csvdf = spark.readStream.option("sep", 
",").schema(userSchema).csv("server_path") //does not resolve errorval csvdf = 
spark.readStream.option("sep", 
",").schema(userSchema).format("org.apache.spark.sql.execution.datasources.csv").csv("server_path")
 //does not resolve errorval csvdf = spark.readStream.option("sep", 
",").schema(userSchema).format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").csv("server_path")
 //does not resolve errorval csvdf = spark.readStream.option("sep", 
",").schema(userSchema).format("com.databricks.spark.csv.DefaultSource15").csv("server_path")
 //does not resolve error
    

    


Access multiple dictionaries inside list in Scala

2017-04-12 Thread Srabasti Banerjee
Hi All,
Is there a way to access multiple dictionaries with different schema structures 
inside a list in txt file, individually in isolation/combination as needed, 
from Spark shell using Scala? 
The need is to use information from different combinations of the dictionaries 
to calculate for reporting purpose.
Thought of different options - 

1) Load the data from txt file into data frame, and then segregating them into 
different data frames. This way, all the different dictionaries with different 
schema structures inside the list can be accessed as separate data frames, and 
calculations done using Spark Sql 

2) Converting the txt file to parquet or json file might not be much helpful, 
as the issue of being able to access the different dictionaries individually or 
in combination would still be a challenge. 

Hence am thinking the first option would be better.
Any other suggestions?
ThanksSrabasti


Re: welcoming Burak and Holden as committers

2017-01-24 Thread Srabasti Banerjee
Congratulations Holden & Burak :-)
ThanksSrabasti
 

On Tuesday, 24 January 2017 10:51 AM, shane knapp  
wrote:
 

 congrats to the both of you!  :)

On Tue, Jan 24, 2017 at 10:13 AM, Reynold Xin  wrote:
> Hi all,
>
> Burak and Holden have recently been elected as Apache Spark committers.
>
> Burak has been very active in a large number of areas in Spark, including
> linear algebra, stats/maths functions in DataFrames, Python/R APIs for
> DataFrames, dstream, and most recently Structured Streaming.
>
> Holden has been a long time Spark contributor and evangelist. She has
> written a few books on Spark, as well as frequent contributions to the
> Python API to improve its usability and performance.
>
> Please join me in welcoming the two!
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org