Just wondering if this is what you are implying Ryan (example only):
val data = (dataset to be partitionned)
val splitCondition =
s"""
CASE
WHEN …. THEN ….
WHEN …. THEN …..
END partition_condition
"""
val partitionedData = data.withColumn("partitionColumn",
Thank you Ryan and Xiao – sharing all this info really gives a very good
insight!
From: Ryan Blue
Reply-To: "rb...@netflix.com"
Date: Monday, December 3, 2018 at 12:05 PM
To: "Thakrar, Jayesh"
Cc: Xiao Li , Spark Dev List
Subject: Re: DataSourceV2 community sync #3
Ja
To: "Thakrar, Jayesh"
Cc: Ryan Blue , "u...@spark.apache.org"
Subject: Re: DataSourceV2 community sync #3
Hi, Jayesh,
This is a good question. Spark is a unified analytics engine for various data
sources. We are able to get the table schema from the underlying data sources
Just curious on the need for a catalog within Spark.
So Spark interface with other systems – many of which have a catalog of their
own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS,
filesyststem, etc).
So what is the purpose of having this catalog within Spark for tables
Thank you for the quick reply Dongjoon.
This sound interesting and it might the resolution for our issue.
Let me see do some tests and will update the thread.
Thanks,
Jayesh
From: Dongjoon Hyun
Date: Wednesday, November 21, 2018 at 11:46 AM
To: "Thakrar, Jayesh"
Cc: dev
Subject:
Hi All,
We have some batch processing where we read 100s of thousands of ORC files.
What I found is that this was taking too much time AND that there was a long
pause between the point the read begins in the code and the executors get into
action.
That period is about 1.5+ hours where only the
September 19, 2018 at 4:35 PM
To: "Thakrar, Jayesh"
Cc: "tigerqu...@outlook.com" , Spark Dev List
Subject: Re: [Discuss] Datasource v2 support for manipulating partitions
What does partition management look like in those systems and what are the
options we would standard
by the platform.
Similar to the static/dynamic partition loading in Hive and Oracle.
So in short, I agree that partition management should be an optional interface.
From: Ryan Blue
Reply-To: "rb...@netflix.com"
Date: Wednesday, September 19, 2018 at 2:58 PM
To: "Thakrar, Jayesh
Thanks for the info Ryan – very helpful!
From: Ryan Blue
Reply-To: "rb...@netflix.com"
Date: Wednesday, September 19, 2018 at 3:17 PM
To: "Thakrar, Jayesh"
Cc: Wenchen Fan , Hyukjin Kwon ,
Spark Dev List
Subject: Re: data source api v2 refactoring
Hi Jayesh,
The exis
Totally agree with you Dale, that there are situations for efficiency,
performance and better control/visibility/manageability that we need to expose
partition management.
So as described, I suggested two things - the ability to do it in the current
V2 API form via options and appropriate
I am not involved with the design or development of the V2 API - so these could
be naïve comments/thoughts.
Just as dataset is to abstract away from RDD, which otherwise required a little
more intimate knowledge about Spark internals, I am guessing the absence of
partition operations is either
connector may not be good and flexible
enough.
From: Russell Spitzer
Date: Tuesday, September 11, 2018 at 9:58 AM
To: "Thakrar, Jayesh"
Cc: Arun Mahadevan , Jungtaek Lim ,
Wenchen Fan , Reynold Xin , Ross
Lawley , Ryan Blue , dev
, "dbis...@us.ibm.com"
Subject: Re: Dat
So if Spark and the destination datastore are both non-transactional, you will
have to resort to an external mechanism for “transactionality”.
Here are some options for both RDBMS and non-transaction datastore destination.
For now assuming that Spark is used in batch mode (and not streaming
Ryan et al,
Wondering if the existing Spark based data sources (e.g. for HDFS, Kafka) have
been ported to V2.
I remember reading threads where there were discussions about the
inefficiency/overhead of converting from Row to InternalRow that was preventing
certain porting effort etc.
I ask
data. I have lots of table to transport from one environment to other
and I don’t want to create unnecessary load on the DB.
On 7/12/18, 10:09 AM, "Thakrar, Jayesh"
wrote:
One option is to use plain JDBC to interrogate Postgresql catalog for
the source tab
One option is to use plain JDBC to interrogate Postgresql catalog for the
source table and generate the DDL to create the destination table.
Then using plain JDBC again, create the table at the destination.
See the link below for some pointers…..
From: Joseph Torres <joseph.tor...@databricks.com>
Sent: Tuesday, May 1, 2018 1:58:54 PM
To: Ryan Blue
Cc: Thakrar, Jayesh; dev@spark.apache.org
Subject: Re: Datasource API V2 and checkpointing
I agree that Spark should fully handle state serialization and re
Thanks Joseph!
From: Joseph Torres <joseph.tor...@databricks.com>
Date: Friday, April 27, 2018 at 11:23 AM
To: "Thakrar, Jayesh" <jthak...@conversantmedia.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Subject: Re: Datasource API V2 and c
Wondering if this issue is related to SPARK-23323?
Any pointers will be greatly appreciated….
Thanks,
Jayesh
From: "Thakrar, Jayesh" <jthak...@conversantmedia.com>
Date: Monday, April 23, 2018 at 9:49 PM
To: "dev@spark.apache.org" <dev@spark.apache.or
I was wondering when checkpointing is enabled, who does the actual work?
The streaming datasource or the execution engine/driver?
I have written a small/trivial datasource that just generates strings.
After enabling checkpointing, I do see a folder being created under the
checkpoint folder, but
Thanks Sameer!
From: Sameer Agarwal <samee...@apache.org>
Date: Sunday, April 15, 2018 at 10:02 PM
To: "Thakrar, Jayesh" <jthak...@conversantmedia.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>, Hyukjin Kwon
<gurwls...@gmail.com>
Subject:
In browsing through the API docs, the links to Github source code seem to be
pointing to a dev branch rather than the release branch.
Here's one example
Go to the API doc page below and click on the "ProcessingTime.scala" link which
points to Sameer's dev branch.
First of all thank you to the Spark dev team for coming up with the
standardized and intuitive API interfaces.
I am sure it will encourage integrating a lot more new datasource integration.
I have been creating playing with the API and have some questions on the
continuous streaming API
(see
t; to true.
But I should confess that I don't know the source code very well, so will
appreciate if you can point me to any other pointers/examples please.
From: Wenchen Fan <cloud0...@gmail.com>
Date: Thursday, March 22, 2018 at 2:52 PM
To: "Thakrar, Jayesh" <jthak...
(0).asInstanceOf[String])))
sqlContext.internalCreateDataFrame(internalRow, schema, isStreaming = true)
}
}
From: Ryan Blue <rb...@netflix.com>
Reply-To: "rb...@netflix.com" <rb...@netflix.com>
Date: Thursday, March 22, 2018 at 1:45 PM
To: "Thakrar, Jayesh" <jthak...@conversantm
Because these are not exposed in the usual API, its not possible (or difficult)
to create custom structured streaming sources.
Consequently, one has to create streaming sources in packages under
org.apache.spark.sql.
Any pointers or info is greatly appreciated.
Is this in spark-shell or a spark-submit job?
If spark-submit job, is it local or cluster?
One reliable way of adding jars is to use the command line option "--jars"
See
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
for more info.
If you add
I am trying to create a custom streaming source in Spark 2.3.0 and getting the
following error:
scala> 2018-03-19 17:43:20 ERROR MicroBatchExecution:91 - Query [id =
48bb7a4c-7c66-4ad3-926b-81f8369a6efb, runId =
50800f9b-434d-43df-8d6a-3e0fdc865aeb] terminated with error
myDataSourcePartition.rowCount
val partitionData = 1 to rows map(r => Row(s"Partition: ${partitionId}, row
${r} of ${rows}"))
partitionData.iterator
}
}
From: Wenchen Fan <cloud0...@gmail.com>
Date: Wednesday, February 28, 2018 at 12:25 PM
To: "Th
Hi All,
I was just toying with creating a very rudimentary RDD datasource to understand
the inner workings of RDDs.
It seems that one of the constructors for RDD has a parameter of type
SparkContext, but it (apparently) exists on the driver only and is not
serializable.
Consequently, any
Hi All,
I am trying to "test" a very simple custom receiver and am a little puzzled.
Using Spark 2.2.0 shell on my laptop, I am running the code below.
I was expecting the code to timeout since my timeout wait period is 1 ms and I
have a sleep in the class that is much more (1200 ms).
Is this
ed Datasets.
Jim, what if I submit a PR and you join the review process? I also do not mind
to split this if you want, but it seems to be an overkill for this part.
Jayesh, shall I skip the window functions part since you are going to work on
that?
2016-12-15 22:48 GMT+01:00 Thakrar, Jayesh
&l
Teng,
Before you go down creating your own custom Spark system, do give some thought
to what Holden and others are suggesting, viz. using implicit methods.
If you want real concrete examples, have a look at the Spark Cassandra
Connector -
Here you will see an example of "extending"
33 matches
Mail list logo