Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gavin Ray
Wow, really neat -- thanks for sharing!

On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang  wrote:

> Dear Apache Spark community,
>
> We are delighted to announce the launch of a groundbreaking tool that aims
> to make Apache Spark more user-friendly and accessible - the English SDK
> . Powered by the
> application of Generative AI, the English SDK
>  allows you to execute
> complex tasks with simple English instructions. This exciting news was 
> announced
> recently at the Data+AI Summit
>  and also introduced
> through a detailed blog post
> 
> .
>
> Now, we need your invaluable feedback and contributions. The aim of the
> English SDK is not only to simplify and enrich your Apache Spark experience
> but also to grow with the community. We're calling upon Spark developers
> and users to explore this innovative tool, offer your insights, provide
> feedback, and contribute to its evolution.
>
> You can find more details about the SDK and usage examples on the GitHub
> repository https://github.com/databrickslabs/pyspark-ai/. If you have any
> feedback or suggestions, please feel free to open an issue directly on the
> repository. We are actively monitoring the issues and value your insights.
>
> We also welcome pull requests and are eager to see how you might extend or
> refine this tool. Let's come together to continue making Apache Spark more
> approachable and user-friendly.
>
> Thank you in advance for your attention and involvement. We look forward
> to hearing your thoughts and seeing your contributions!
>
> Best,
> Gengliang Wang
>


Re: Complexity with the data

2022-05-25 Thread Gavin Ray
Forgot to reply-all last message, whoops. Not very good at email.

You need to normalize the CSV with a parser that can escape commas inside
of strings
Not sure if Spark has an option for this?


On Wed, May 25, 2022 at 4:37 PM Sid  wrote:

> Thank you so much for your time.
>
> I have data like below which I tried to load by setting multiple options
> while reading the file but however, but I am not able to consolidate the
> 9th column data within itself.
>
> [image: image.png]
>
> I tried the below code:
>
> df = spark.read.option("header", "true").option("multiline",
> "true").option("inferSchema", "true").option("quote",
>
> '"').option(
> "delimiter", ",").csv("path")
>
> What else I can do?
>
> Thanks,
> Sid
>
>
> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
> papad...@csd.auth.gr> wrote:
>
>> Dear Sid,
>>
>> can you please give us more info? Is it true that every line may have a
>> different number of columns? Is there any rule followed by
>>
>> every line of the file? From the information you have sent I cannot
>> fully understand the "schema" of your data.
>>
>> Regards,
>>
>> Apostolos
>>
>>
>> On 25/5/22 23:06, Sid wrote:
>> > Hi Experts,
>> >
>> > I have below CSV data that is getting generated automatically. I can't
>> > change the data manually.
>> >
>> > The data looks like below:
>> >
>> > 2020-12-12,abc,2000,,INR,
>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>> > 2020-12-09,fgh,,software_developer,I only manage the development part.
>> >
>> > Since I don't have much experience with the other domains.
>> >
>> > It is handled by the other people.,INR
>> > 2020-12-12,abc,2000,,USD,
>> >
>> > The third record is a problem. Since the value is separated by the new
>> > line by the user while filling up the form. So, how do I handle this?
>> >
>> > There are 6 columns and 4 records in total. These are the sample
>> records.
>> >
>> > Should I load it as RDD and then may be using a regex should eliminate
>> > the new lines? Or how it should be? with ". /n" ?
>> >
>> > Any suggestions?
>> >
>> > Thanks,
>> > Sid
>>
>> --
>> Apostolos N. Papadopoulos, Associate Professor
>> Department of Informatics
>> Aristotle University of Thessaloniki
>> Thessaloniki, GREECE
>> tel: ++0030312310991918
>> email: papad...@csd.auth.gr
>> twitter: @papadopoulos_ap
>> web: http://datalab.csd.auth.gr/~apostol
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: [Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

2022-05-18 Thread Gavin Ray
Following up on this in case anyone runs across it in the archives in the
future
>From reading through the config docs and trying various combinations, I've
discovered that:

- You don't want to disable codegen. This roughly doubled the time to
perform simple, few-column/few-row queries from basic testing
  -  Can test this by setting an internal property after setting
"spark.testing" to "true" in system properties


> System.setProperty("spark.testing", "true")
> val spark = SparkSession.builder()
>   .config("spark.sql.codegen.wholeStage", "false")
>   .config("spark.sql.codegen.factoryMode", "NO_CODEGEN")
>

-  The following gave the best performance. I don't know if enabling CBO
did much.

val spark = SparkSession.builder()
> .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .config("spark.kryo.unsafe", "true")
> .config("spark.sql.adaptive.enabled", "true")
> .config("spark.sql.cbo.enabled", "true")
> .config("spark.sql.cbo.joinReorder.dp.star.filter", "true")
> .config("spark.sql.cbo.joinReorder.enabled", "true")
> .config("spark.sql.cbo.planStats.enabled", "true")
> .config("spark.sql.cbo.starSchemaDetection", "true")


If you're running on more recent JDK's, you'll need to set "--add-opens"
flags for a few namespaces for "kryo.unsafe" to work.



On Mon, May 16, 2022 at 12:55 PM Gavin Ray  wrote:

> Hi all,
>
> I've not got much experience with Spark, but have been reading the
> Catalyst and
> Datasources V2 code/tests to try to get a basic understanding.
>
> I'm interested in trying Catalyst's query planner + optimizer for queries
> spanning one-or-more JDBC sources.
>
> Somewhat unusually, I'd like to do this with as minimal latency as
> possible to
> see what the experience for standard line-of-business apps is like (~90/10
> read/write ratio).
> Few rows would be returned in the reads (something on the order of
> 1-to-1,000).
>
> My question is: What configuration settings would you want to use for
> something
> like this?
>
> I imagine that doing codegen/JIT compilation of the query plan might not be
> worth the cost, so maybe you'd want to disable that and do interpretation?
>
> And possibly you'd want to use query plan config/rules that reduce the time
> spent in planning, trading efficiency for latency?
>
> Does anyone know how you'd configure Spark to test something like this?
>
> Would greatly appreciate any input (even if it's "This is a bad idea and
> will
> never work well").
>
> Thank you =)
>


[SQL] Why does a small two-source JDBC query take ~150-200ms with all optimizations (AQE, CBO, pushdown, Kryo, unsafe) enabled? (v3.4.0-SNAPSHOT)

2022-05-18 Thread Gavin Ray
I did some basic testing of multi-source queries with the most recent Spark:
https://github.com/GavinRay97/spark-playground/blob/44a756acaee676a9b0c128466e4ab231a7df8d46/src/main/scala/Application.scala#L46-L115

The output of "spark.time()" surprised me:

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 1

+---++---+--+
| id|name| id| title|
+---++---+--+
|  1| Bob|  1|Todo 1|
|  1| Bob|  2|Todo 2|
+---++---+--+
Time taken: 168 ms

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 2
LIMIT 1

+---+-+---+--+
| id| name| id| title|
+---+-+---+--+
|  2|Alice|  3|Todo 3|
+---+-+---+--+
Time taken: 228 ms


Calcite and Teiid manage to do this on the order of 5-50ms for basic
queries,
so I'm curious about the technical specifics on why Spark appears to be so
much slower here?


[Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

2022-05-16 Thread Gavin Ray
Hi all,

I've not got much experience with Spark, but have been reading the Catalyst
and
Datasources V2 code/tests to try to get a basic understanding.

I'm interested in trying Catalyst's query planner + optimizer for queries
spanning one-or-more JDBC sources.

Somewhat unusually, I'd like to do this with as minimal latency as possible
to
see what the experience for standard line-of-business apps is like (~90/10
read/write ratio).
Few rows would be returned in the reads (something on the order of
1-to-1,000).

My question is: What configuration settings would you want to use for
something
like this?

I imagine that doing codegen/JIT compilation of the query plan might not be
worth the cost, so maybe you'd want to disable that and do interpretation?

And possibly you'd want to use query plan config/rules that reduce the time
spent in planning, trading efficiency for latency?

Does anyone know how you'd configure Spark to test something like this?

Would greatly appreciate any input (even if it's "This is a bad idea and
will
never work well").

Thank you =)