Re: Feature request: split dataset based on condition

2019-02-04 Thread Thakrar, Jayesh
Just wondering if this is what you are implying Ryan (example only): val data = (dataset to be partitionned) val splitCondition = s""" CASE WHEN …. THEN …. WHEN …. THEN ….. END partition_condition """ val partitionedData = data.withColumn("partitionColumn",

Re: DataSourceV2 community sync #3

2018-12-03 Thread Thakrar, Jayesh
Thank you Ryan and Xiao – sharing all this info really gives a very good insight! From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Monday, December 3, 2018 at 12:05 PM To: "Thakrar, Jayesh" Cc: Xiao Li , Spark Dev List Subject: Re: DataSourceV2 community sync #3 Ja

Re: DataSourceV2 community sync #3

2018-12-03 Thread Thakrar, Jayesh
To: "Thakrar, Jayesh" Cc: Ryan Blue , "u...@spark.apache.org" Subject: Re: DataSourceV2 community sync #3 Hi, Jayesh, This is a good question. Spark is a unified analytics engine for various data sources. We are able to get the table schema from the underlying data sources

Re: DataSourceV2 community sync #3

2018-12-01 Thread Thakrar, Jayesh
Just curious on the need for a catalog within Spark. So Spark interface with other systems – many of which have a catalog of their own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS, filesyststem, etc). So what is the purpose of having this catalog within Spark for tables

Re: Double pass over ORC data files even after supplying schema and setting inferSchema = false

2018-11-21 Thread Thakrar, Jayesh
Thank you for the quick reply Dongjoon. This sound interesting and it might the resolution for our issue. Let me see do some tests and will update the thread. Thanks, Jayesh From: Dongjoon Hyun Date: Wednesday, November 21, 2018 at 11:46 AM To: "Thakrar, Jayesh" Cc: dev Subject:

Double pass over ORC data files even after supplying schema and setting inferSchema = false

2018-11-21 Thread Thakrar, Jayesh
Hi All, We have some batch processing where we read 100s of thousands of ORC files. What I found is that this was taking too much time AND that there was a long pause between the point the read begins in the code and the executors get into action. That period is about 1.5+ hours where only the

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-20 Thread Thakrar, Jayesh
September 19, 2018 at 4:35 PM To: "Thakrar, Jayesh" Cc: "tigerqu...@outlook.com" , Spark Dev List Subject: Re: [Discuss] Datasource v2 support for manipulating partitions What does partition management look like in those systems and what are the options we would standard

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Thakrar, Jayesh
by the platform. Similar to the static/dynamic partition loading in Hive and Oracle. So in short, I agree that partition management should be an optional interface. From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Wednesday, September 19, 2018 at 2:58 PM To: "Thakrar, Jayesh

Re: data source api v2 refactoring

2018-09-19 Thread Thakrar, Jayesh
Thanks for the info Ryan – very helpful! From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Wednesday, September 19, 2018 at 3:17 PM To: "Thakrar, Jayesh" Cc: Wenchen Fan , Hyukjin Kwon , Spark Dev List Subject: Re: data source api v2 refactoring Hi Jayesh, The exis

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-18 Thread Thakrar, Jayesh
Totally agree with you Dale, that there are situations for efficiency, performance and better control/visibility/manageability that we need to expose partition management. So as described, I suggested two things - the ability to do it in the current V2 API form via options and appropriate

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-16 Thread Thakrar, Jayesh
I am not involved with the design or development of the V2 API - so these could be naïve comments/thoughts. Just as dataset is to abstract away from RDD, which otherwise required a little more intimate knowledge about Spark internals, I am guessing the absence of partition operations is either

Re: DataSourceWriter V2 Api questions

2018-09-13 Thread Thakrar, Jayesh
connector may not be good and flexible enough. From: Russell Spitzer Date: Tuesday, September 11, 2018 at 9:58 AM To: "Thakrar, Jayesh" Cc: Arun Mahadevan , Jungtaek Lim , Wenchen Fan , Reynold Xin , Ross Lawley , Ryan Blue , dev , "dbis...@us.ibm.com" Subject: Re: Dat

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Thakrar, Jayesh
So if Spark and the destination datastore are both non-transactional, you will have to resort to an external mechanism for “transactionality”. Here are some options for both RDBMS and non-transaction datastore destination. For now assuming that Spark is used in batch mode (and not streaming

Re: data source api v2 refactoring

2018-09-07 Thread Thakrar, Jayesh
Ryan et al, Wondering if the existing Spark based data sources (e.g. for HDFS, Kafka) have been ported to V2. I remember reading threads where there were discussions about the inefficiency/overhead of converting from Row to InternalRow that was preventing certain porting effort etc. I ask

Re: Creating JDBC source table schema(DDL) dynamically

2018-07-12 Thread Thakrar, Jayesh
data. I have lots of table to transport from one environment to other and I don’t want to create unnecessary load on the DB. On 7/12/18, 10:09 AM, "Thakrar, Jayesh" wrote: One option is to use plain JDBC to interrogate Postgresql catalog for the source tab

Re: Creating JDBC source table schema(DDL) dynamically

2018-07-12 Thread Thakrar, Jayesh
One option is to use plain JDBC to interrogate Postgresql catalog for the source table and generate the DDL to create the destination table. Then using plain JDBC again, create the table at the destination. See the link below for some pointers…..

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Thakrar, Jayesh
From: Joseph Torres <joseph.tor...@databricks.com> Sent: Tuesday, May 1, 2018 1:58:54 PM To: Ryan Blue Cc: Thakrar, Jayesh; dev@spark.apache.org Subject: Re: Datasource API V2 and checkpointing I agree that Spark should fully handle state serialization and re

Re: Datasource API V2 and checkpointing

2018-04-27 Thread Thakrar, Jayesh
Thanks Joseph! From: Joseph Torres <joseph.tor...@databricks.com> Date: Friday, April 27, 2018 at 11:23 AM To: "Thakrar, Jayesh" <jthak...@conversantmedia.com> Cc: "dev@spark.apache.org" <dev@spark.apache.org> Subject: Re: Datasource API V2 and c

Re: Datasource API V2 and checkpointing

2018-04-27 Thread Thakrar, Jayesh
Wondering if this issue is related to SPARK-23323? Any pointers will be greatly appreciated…. Thanks, Jayesh From: "Thakrar, Jayesh" <jthak...@conversantmedia.com> Date: Monday, April 23, 2018 at 9:49 PM To: "dev@spark.apache.org" <dev@spark.apache.or

Datasource API V2 and checkpointing

2018-04-23 Thread Thakrar, Jayesh
I was wondering when checkpointing is enabled, who does the actual work? The streaming datasource or the execution engine/driver? I have written a small/trivial datasource that just generates strings. After enabling checkpointing, I do see a folder being created under the checkpoint folder, but

Re: V2.3 Scala API to Github Links Incorrect

2018-04-15 Thread Thakrar, Jayesh
Thanks Sameer! From: Sameer Agarwal <samee...@apache.org> Date: Sunday, April 15, 2018 at 10:02 PM To: "Thakrar, Jayesh" <jthak...@conversantmedia.com> Cc: "dev@spark.apache.org" <dev@spark.apache.org>, Hyukjin Kwon <gurwls...@gmail.com> Subject:

V2.3 Scala API to Github Links Incorrect

2018-04-15 Thread Thakrar, Jayesh
In browsing through the API docs, the links to Github source code seem to be pointing to a dev branch rather than the release branch. Here's one example Go to the API doc page below and click on the "ProcessingTime.scala" link which points to Sameer's dev branch.

Spark 2.3 V2 Datasource API questions

2018-04-06 Thread Thakrar, Jayesh
First of all thank you to the Spark dev team for coming up with the standardized and intuitive API interfaces. I am sure it will encourage integrating a lot more new datasource integration. I have been creating playing with the API and have some questions on the continuous streaming API (see

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Thakrar, Jayesh
t; to true. But I should confess that I don't know the source code very well, so will appreciate if you can point me to any other pointers/examples please. From: Wenchen Fan <cloud0...@gmail.com> Date: Thursday, March 22, 2018 at 2:52 PM To: "Thakrar, Jayesh" <jthak...

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Thakrar, Jayesh
(0).asInstanceOf[String]))) sqlContext.internalCreateDataFrame(internalRow, schema, isStreaming = true) } } From: Ryan Blue <rb...@netflix.com> Reply-To: "rb...@netflix.com" <rb...@netflix.com> Date: Thursday, March 22, 2018 at 1:45 PM To: "Thakrar, Jayesh" <jthak...@conversantm

Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Thakrar, Jayesh
Because these are not exposed in the usual API, its not possible (or difficult) to create custom structured streaming sources. Consequently, one has to create streaming sources in packages under org.apache.spark.sql. Any pointers or info is greatly appreciated.

Re: "Spark.jars not adding jars to classpath"

2018-03-22 Thread Thakrar, Jayesh
Is this in spark-shell or a spark-submit job? If spark-submit job, is it local or cluster? One reliable way of adding jars is to use the command line option "--jars" See http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management for more info. If you add

Cannot create custom streaming source in Spark 2.3.0

2018-03-19 Thread Thakrar, Jayesh
I am trying to create a custom streaming source in Spark 2.3.0 and getting the following error: scala> 2018-03-19 17:43:20 ERROR MicroBatchExecution:91 - Query [id = 48bb7a4c-7c66-4ad3-926b-81f8369a6efb, runId = 50800f9b-434d-43df-8d6a-3e0fdc865aeb] terminated with error

Re: SparkContext - parameter for RDD, but not serializable, why?

2018-02-28 Thread Thakrar, Jayesh
myDataSourcePartition.rowCount val partitionData = 1 to rows map(r => Row(s"Partition: ${partitionId}, row ${r} of ${rows}")) partitionData.iterator } } From: Wenchen Fan <cloud0...@gmail.com> Date: Wednesday, February 28, 2018 at 12:25 PM To: "Th

SparkContext - parameter for RDD, but not serializable, why?

2018-02-28 Thread Thakrar, Jayesh
Hi All, I was just toying with creating a very rudimentary RDD datasource to understand the inner workings of RDDs. It seems that one of the constructors for RDD has a parameter of type SparkContext, but it (apparently) exists on the driver only and is not serializable. Consequently, any

Spark Streaming Custom Receiver Anomaly

2018-02-20 Thread Thakrar, Jayesh
Hi All, I am trying to "test" a very simple custom receiver and am a little puzzled. Using Spark 2.2.0 shell on my laptop, I am running the code below. I was expecting the code to timeout since my timeout wait period is 1 ms and I have a sleep in the class that is much more (1200 ms). Is this

Re: Expand the Spark SQL programming guide?

2016-12-16 Thread Thakrar, Jayesh
ed Datasets. Jim, what if I submit a PR and you join the review process? I also do not mind to split this if you want, but it seems to be an overkill for this part. Jayesh, shall I skip the window functions part since you are going to work on that? 2016-12-15 22:48 GMT+01:00 Thakrar, Jayesh &l

Re: Can I add a new method to RDD class?

2016-12-05 Thread Thakrar, Jayesh
Teng, Before you go down creating your own custom Spark system, do give some thought to what Holden and others are suggesting, viz. using implicit methods. If you want real concrete examples, have a look at the Spark Cassandra Connector - Here you will see an example of "extending"