[PySpark] [applyInPandas] Regression Bug: Cogroup in pandas drops columns from the first dataframe

2022-11-25 Thread Michael Bílý
) # 10.0.1 It works on AWS Glue session with these versions: [image: image.png] It prints: +--+-+ |day |value| +--+-+ |2017-08-17|1| +--+-+ as expected. Thank you, Michael

Issues getting Apache Spark

2022-05-26 Thread Martin, Michael
to work on my laptop. Michael Martin

RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
Ahh, ok. So, Kafka 3.1 is supported for Spark 3.2.1. Thank you very much. From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Friday, February 25, 2022 2:50 PM To: Michael Williams (SSI) Cc: user@spark.apache.org Subject: Re: Spark Kafka Integration these are the old and news ones

RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
Thank you, that is good to know. From: Sean Owen [mailto:sro...@gmail.com] Sent: Friday, February 25, 2022 2:46 PM To: Michael Williams (SSI) Cc: Mich Talebzadeh ; user@spark.apache.org Subject: Re: Spark Kafka Integration Spark 3.2.1 is compiled vs Kafka 2.8.0; the forthcoming Spark 3.3

RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
Thank you From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Friday, February 25, 2022 2:35 PM To: Michael Williams (SSI) Cc: Sean Owen ; user@spark.apache.org Subject: Re: Spark Kafka Integration please see my earlier reply for 3.1.1 tested and worked in Google Dataproc

RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
To: Michael Williams (SSI) Cc: user@spark.apache.org Subject: Re: Spark Kafka Integration and what version of kafka do you have 2.7? for spark 3.1.1 I needed these jar files to make it work kafka-clients-2.7.0.jar commons-pool2-2.9.0.jar spark-streaming_2.12-3.1.1.jar spark-sql-kafka-0-10_2.12

RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
exist on disk. If that makes any sense. Thank you From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Friday, February 25, 2022 2:16 PM To: Michael Williams (SSI) Cc: user@spark.apache.org Subject: Re: Spark Kafka Integration What is the use case? Is this for spark structured

Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
After reviewing Spark's Kafka Integration guide, it indicates that spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for Spark 3.2.1 (+ Scala 2.12) to work with Kafka. Can anybody clarify the cleanest, most repeatable (reliable) way to acquire these jars for including in a

RE: Consuming from Kafka to delta table - stream or batch mode?

2022-02-24 Thread Michael Williams (SSI)
Thank you. From: Peyman Mohajerian [mailto:mohaj...@gmail.com] Sent: Thursday, February 24, 2022 9:00 AM To: Michael Williams (SSI) Cc: user@spark.apache.org Subject: Re: Consuming from Kafka to delta table - stream or batch mode? If you want to batch consume from Kafka, trigger-once config

Consuming from Kafka to delta table - stream or batch mode?

2022-02-24 Thread Michael Williams (SSI)
Hello, Our team is working with Spark (for the first time) and one of the sources we need to consume is Kafka (multiple topics). Are there any practical or operational issues to be aware of when deciding whether to a) consume in batches until all messages are consumed then shut down the spark

RE: Logging to determine why driver fails

2022-02-21 Thread Michael Williams (SSI)
Thank you. From: Artemis User [mailto:arte...@dtechspace.com] Sent: Monday, February 21, 2022 8:23 AM To: Michael Williams (SSI) Subject: Re: Logging to determine why driver fails Spark uses Log4j for logging. There is a log4j properties template file located in the conf directory. You can

Logging to determine why driver fails

2022-02-21 Thread Michael Williams (SSI)
Hello, We have a POC using Spark 3.2.1 and none of us have any prior Spark experience. Our setup uses the native Spark REST api (http://localhost:6066/v1/submissions/create) on the master node (not Livy, not Spark Job server). While we have been successful at submitting python jobs via this

triggering spark python app using native REST api

2022-01-24 Thread Michael Williams (SSI)
Hello, I've been trying to work out how to replicate execution of a python app using spark-submit via the CLI using the native spark REST api (http://localhost:6066/v1/submissions/create) for a couple of weeks without success. The environment is docker using the latest docker for spark 3.2

Re: Handling skew in window functions

2021-04-28 Thread Michael Doo
k. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or dest

Handling skew in window functions

2021-04-27 Thread Michael Doo
with a count >= 10,000, 100 records with a count between 1,000 and 10,000, and 150,000 entries with a count less than 1,000 (usually count = 1). Thanks in advance! Michael Code: ``` import pyspark.sql.functions as ffrom pyspark.sql.window import Window empty_col_a_cond = ((f.col("col_A&

[Spark Streaming] support of non-timebased windows and lag function

2020-12-22 Thread Moser, Michael
re general, would be to use joins, but for this a streaming-ready, monotonically increasing and (!) concurrent uid would be needed. Thanks a lot & best, Michael

[Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Moser, Michael
nic point of view, I could also imagine a single read/write >method, with the format as an arg and kwargs related to the different file >format options. Best, Michael

Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-24 Thread Michael Mior
If you want to ensure the persisted RDD has been calculated first, just run foreach with a dummy function first to force evaluation. -- Michael Mior michael.m...@gmail.com Le jeu. 24 sept. 2020 à 00:38, Arya Ketan a écrit : > > Thanks, we were able to validate the same behaviour. > >

Re: Merging Parquet Files

2020-09-03 Thread Michael Segel
Hi, I think you’re asking the right question, however you’re making an assumption that he’s on the cloud and he never talked about the size of the file. It could be that he’s got a lot of small-ish data sets. 1GB is kinda small in relative terms. Again YMMV. Personally if you’re going

Spark 3 Release

2020-04-29 Thread Michael Edwards
Hi, We’re using Spark 2.4.5 in our project with Java 1.8. We wanted to upgrade to Java 11 but it seems we need to upgrade to Spark 3 to make this possible. I’ve tried it with v3.0.0-preview2 but wanted to ask if there was a release due in the near future. Regards, MIke -- Michael Edwards

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread Michael Artz
What would you do with it once you get it into driver in a Dataset[Row]? Sent from my iPhone > On Apr 22, 2020, at 3:06 AM, maqy <454618...@qq.com> wrote: > >  > When the data is stored in the Dataset [Row] format, the memory usage is very > small. > When I use collect () to collect data to

Re: Why Apache Spark doesn't use Calcite?

2020-01-13 Thread Michael Mior
It's fairly common for adapters (Calcite's abstraction of a data source) to push down predicates. However, the API certainly looks a lot different than Catalyst's. -- Michael Mior mm...@apache.org Le lun. 13 janv. 2020 à 09:45, Jason Nerothin a écrit : > > The implementation they chose su

Re: Announcing Delta Lake 0.2.0

2019-06-21 Thread Michael Armbrust
> > Thanks for confirmation. We are using the workaround to create a separate > Hive external table STORED AS PARQUET with the exact location of Delta > table. Our use case is batch-driven and we are running VACUUM with 0 > retention after every batch is completed. Do you see any potential problem

Re: [EXT] handling skewness issues

2019-04-29 Thread Michael Mansour
There were recently some fantastic talks about this at the SparkSummit conference in San Francisco. I suggest you check out the SparkSummit YouTube channel after May 9th for a deep dive into this topic. From: rajat kumar Date: Monday, April 29, 2019 at 9:34 AM To: "user@spark.apache.org"

Re: Spark DataSets and multiple write(.) calls

2018-11-20 Thread Michael Shtelma
You can also cache the data frame on disk, if it does not fit into memory. An alternative would be to write out data frame as parquet and then read it, you can check if in this case the whole pipeline works faster as with the standard cache. Best, Michael On Tue, Nov 20, 2018 at 9:14 AM Dipl

Re: Read Avro Data using Spark Streaming

2018-11-14 Thread Michael Shtelma
Hi, you can use this project in order to read Avro using Spark Structured Streaming: https://github.com/AbsaOSS/ABRiS Spark 2.4 has also built in support for Avro, so you can use from_avro function in Spark 2.4. Best, Michael On Sat, Nov 3, 2018 at 4:34 AM Divya Narayan wrote: > Hi, &g

Re: How to increase the parallelism of Spark Streaming application?

2018-11-07 Thread Michael Shtelma
If you configure to many Kafka partitions, you can run into memory issues. This will increase memory requirements for spark job a lot. Best, Michael On Wed, Nov 7, 2018 at 8:28 AM JF Chen wrote: > I have a Spark Streaming application which reads data from kafka and s

Re: I want run deep neural network on Spark

2018-10-31 Thread Kunkel, Michael C.
Greetings, There are libraries for deep neural nets that can be used with spark. DL4J is one, and it’s as simple as changing a constructor and the maven dependency. BR MK Michael C. Kunkel, USMC, PhD Forschungszentrum Jülich Nuclear Physics Institute

Re: Lightweight pipeline execution for single eow

2018-09-23 Thread Michael Artz
Are you using the scheduler in fair mode instead of fifo mode? Sent from my iPhone > On Sep 22, 2018, at 12:58 AM, Jatin Puri wrote: > > Hi. > > What tactics can I apply for such a scenario. > > I have a pipeline of 10 stages. Simple text processing. I train the data with > the pipeline

Unsubscribe

2018-08-30 Thread Michael Styles

Re: Pitfalls of partitioning by host?

2018-08-27 Thread Michael Artz
Well if we think of shuffling as a necessity to perform an operation, then the problem would be that you are adding a ln aggregation stage to a job that is going to get shuffled anyway. Like if you need to join two datasets, then Spark will still shuffle the data, whether they are grouped by

Re: [Arrow][Dremio]

2018-05-14 Thread Michael Shtelma
performance with Spark ? Are there any benchmarks ? Best, Michael On Mon, May 14, 2018 at 6:53 AM, xmehaut <xavier.meh...@gmail.com> wrote: > Hello, > I've some question about Spark and Apache Arrow. Up to now, Arrow is only > used for sharing data between Python and Spark ex

Re: Spark 2.3.0 Structured Streaming Kafka Timestamp

2018-05-11 Thread Michael Armbrust
Hmm yeah that does look wrong. Would be great if someone opened a PR to correct the docs :) On Thu, May 10, 2018 at 5:13 PM Yuta Morisawa wrote: > The problem is solved. > The actual schema of Kafka message is different from documentation. > > >

Re: Dataframe vs dataset

2018-05-01 Thread Michael Artz
a > woman is a subset of a human. > > > > All DataFrames are DataSets. Not all Datasets are DataFrames. The “subset” > relationship doesn’t apply here. A DataFrame is a specialized type of > DataSet > > > > *From: *Michael Artz <michaelea...@gmail.com> > *

Re: [EXT] [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Michael Mansour
expand on what you're trying to achieve here. -- Michael Mansour Data Scientist Symantec CASB On 4/28/18, 8:41 AM, "klrmowse" <klrmo...@gmail.com> wrote: i am currently trying to find a workaround for the Spark application i am working on so that it does not have

Re: Dataframe vs dataset

2018-04-28 Thread Michael Artz
datasets as typed df and therefore ds are enhanced df > Feel free to disagree.. > Kr > > On Sat, Apr 28, 2018, 2:24 PM Michael Artz <michaelea...@gmail.com> wrote: > >> Hi, >> >> I use Spark everyday and I have a good grip on the basics of Spark, so >> this qu

Dataframe vs dataset

2018-04-28 Thread Michael Artz
Hi, I use Spark everyday and I have a good grip on the basics of Spark, so this question isnt for myself. But this came up and I wanted to see what other Spark users would say, and I dont want to influence your answer. And SO is weird about polls. The question is "Which one do you feel is

Re: schema change for structured spark streaming using jsonl files

2018-04-25 Thread Michael Segel
Hi, This is going to sound complicated. Taken as an individual JSON document, because its a self contained schema doc, its structured. However there isn’t a persisting schema that has to be consistent across multiple documents. So you can consider it semi structured. If you’re parsing the

INSERT INTO TABLE_PARAMS fails during ANALYZE TABLE

2018-04-19 Thread Michael Shtelma
iled stack trace can be seen here: https://gist.github.com/mshtelma/c5ee8206200533fc1d606964dd5a30e2 Is it a known issue ? Best, Michael

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Michael Armbrust
You can calculate argmax using a struct. df.groupBy($"id").agg(max($"my_timestamp", struct($"*").as("data")).getField("data").select($"data.*") You could transcode this to SQL, it'll just be complicated nested queries. On Wed, Apr 18, 2018 at 3:40 PM, kant kodali wrote: >

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-28 Thread Michael Shtelma
Hi, this property will be used in YARN mode only by the driver. Executors will use the properties coming from YARN for storing temporary files. Best, Michael On Wed, Mar 28, 2018 at 7:37 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > > As per docum

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-27 Thread Michael Shtelma
Hi, the Jira Bug is here: https://issues.apache.org/jira/browse/SPARK-23799 I have also created the PR for the issue: https://github.com/apache/spark/pull/20913 With this fix, it is working for me really well. Best, Michael On Sat, Mar 24, 2018 at 12:39 AM, Takeshi Yamamuro <ling

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Michael Shtelma
-dirs parameter defined in yarn-site.xml. After this change, -Djava.io.tmpdir for the spark executor was set correctly, according to yarn.nodemanager.local-dirs parameter. Best, Michael On Mon, Mar 26, 2018 at 9:15 PM, Keith Chapman <keithgchap...@gmail.com> wrote: > Hi Michael,

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma
if this helps. I will also try to understand, why in my case, I am getting NaN. Best, Michael On Fri, Mar 23, 2018 at 1:51 PM, Takeshi Yamamuro <linguin@gmail.com> wrote: > hi, > > What's a query to reproduce this? > It seems when casting double to BigDecimal, it th

Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma
used. Has anybody already seen something like this ? Any assistance would be greatly appreciated! Best, Michael - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread Michael Armbrust
Those options will not affect structured streaming. You are looking for .option("maxOffsetsPerTrigger", "1000") We are working on improving this by building a generic mechanism into the Streaming DataSource V2 so that the engine can do admission control on the amount of data returned in a

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
/application_1521110306769_0041/container_1521110306769_0041_01_04/tmp JVM is using the second Djava.io.tmpdir parameter and writing everything to the same directory as before. Best, Michael Sincerely, Michael Shtelma On Mon, Mar 19, 2018 at 6:38 PM, Keith Chapman <keithgchap...@gmail.com> wrote: > Ca

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Hi Keith, Thank you for your answer! I have done this, and it is working for spark driver. I would like to make something like this for the executors as well, so that the setting will be used on all the nodes, where I have executors running. Best, Michael On Mon, Mar 19, 2018 at 6:07 PM, Keith

Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
are created under /tmp. I am wondering if there is a way to make spark not use /tmp at all and configure it to create all the files somewhere else ? Any assistance would be greatly appreciated! Best, Michael - To unsubscribe e-mail

Re: [EXT] Debugging a local spark executor in pycharm

2018-03-13 Thread Michael Mansour
, and pass it into the function. This alleviates the need to write debugging code etc. I find this model useful and a bit more fast, but it does not offer the step-through capability. Best of luck! M -- Michael Mansour Data Scientist Symantec CASB From: Vitaliy Pisarev <vitaliy.pisa...@biocatch.

Re: Return statements aren't allowed in Spark closures

2018-02-22 Thread Michael Artz
I am not able to reproduce your error. You should do something before you do that last function and maybe get some more help from the exception it returns. Like just add a csv.show (1) on the line before. Also, can you post the different exception when you took out the "return" value like when

Re: [Structured Streaming] Deserializing avro messages from kafka source using schema registry

2018-02-09 Thread Michael Armbrust
This isn't supported yet, but there is on going work at spark-avro to enable this use case. Stay tuned. On Fri, Feb 9, 2018 at 3:07 PM, Bram wrote: > Hi, > > I couldn't find any documentation about avro message

Re: [Structured Streaming] Commit protocol to move temp files to dest path only when complete, with code

2018-02-09 Thread Michael Armbrust
We didn't go this way initially because it doesn't work on storage systems that have weaker guarantees than HDFS with respect to rename. That said, I'm happy to look at other options if we want to make this configurable. On Fri, Feb 9, 2018 at 2:53 PM, Dave Cameron

Re: Prefer Structured Streaming over Spark Streaming (DStreams)?

2018-01-31 Thread Michael Armbrust
At this point I recommend that new applications are built using structured streaming. The engine was GA-ed as of Spark 2.2 and I know of several very large (trillions of records) production jobs that are running in Structured Streaming. All of our production pipelines at databricks are written

Re: Max number of streams supported ?

2018-01-31 Thread Michael Armbrust
-dev +user > Similarly for structured streaming, Would there be any limit on number of > of streaming sources I can have ? > There is no fundamental limit, but each stream will have a thread on the driver that is doing coordination of execution. We comfortably run 20+ streams on a single

Re: Reading Hive RCFiles?

2018-01-29 Thread Michael Segel
s similarly with PairRDD and RCFileOutputFormat On Thu, Jan 18, 2018 at 5:02 PM, Michael Segel <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote: No idea on how that last line of garbage got in the message. > On Jan 18, 2018, at 9:32 AM, Michael Segel > &l

spark.sql call takes far too long

2018-01-24 Thread Michael Shtelma
/analysis phases take far too long? Is there a way to figure out, what exactly takes too long? Does anybody have any ideas on this? Any assistance would be greatly appreciated! Thanks, Michael - To unsubscribe e-mail: user-unsubscr

Re: [EXT] How do I extract a value in foreachRDD operation

2018-01-22 Thread Michael Mansour
Toy, I suggest your partition your data according to date, and use the forEachPartition function, using the partition as the bucket location. This would require you to define a custom hash partitioner function, but that is not too difficult. -- Michael Mansour Data Scientist Symantec From: Toy

Re: Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
No idea on how that last line of garbage got in the message. > On Jan 18, 2018, at 9:32 AM, Michael Segel <msegel_had...@hotmail.com> wrote: > > Hi, > > I’m trying to find out if there’s a simple way for Spark to be able to read > an RCFile. > > I kno

Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
Hi, I’m trying to find out if there’s a simple way for Spark to be able to read an RCFile. I know I can create a table in Hive, then drop the files in to that directory and use a sql context to read the file from Hive, however I wanted to read the file directly. Not a lot of details to go

Re: Inner join with the table itself

2018-01-16 Thread Michael Shtelma
Hi Jacek, Thank you for the workaround. It is really working in this way: pos.as("p1").join(pos.as("p2")).filter($"p1.POSITION_ID0"===$"p2.POSITION_ID") I have checked, that in this way I get the same execution plan as for the join with renamed columns. Be

Re: Inner join with the table itself

2018-01-15 Thread Michael Shtelma
more really odd thing about all this: a colleague of mine has managed to get the same exception ("Join condition is missing or trivial") also using original SQL query, but I think he has been using empty tables. Thanks, Michael On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang <gengli

Re: Dataset API inconsistencies

2018-01-10 Thread Michael Armbrust
I wrote Datasets, and I'll say I only use them when I really need to (i.e. when it would be very cumbersome to express what I am trying to do relationally). Dataset operations are almost always going to be slower than their DataFrame equivalents since they usually require materializing objects

Re: Spark error while trying to spark.read.json()

2017-12-19 Thread Michael Armbrust
- dev java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath. On Tue, Dec 19, 2017 at 5:42

Using UDF compiled with Janino in Spark

2017-12-15 Thread Michael Shtelma
rg.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worke

Spark multithreaded job submission from driver

2017-12-14 Thread Michael Artz
Hi, I was wanting to pull data from about 1500 remote Oracle tables with Spark, and I want to have a multi-threaded application that picks up a table per thread or maybe 10 tables per thread and launches a spark job to read from their respective tables. I read official spark site

Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Michael Segel
Hi, Just came across this while looking at the docs on how to use Spark’s Kmeans clustering. Note: This appears to be true in both 2.1 and 2.2 documentation. The overview page: https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means Here’ the example contains the following line:

Re: Kafka version support

2017-11-30 Thread Michael Armbrust
Oh good question. I was saying that the stock structured streaming connector should be able to talk to 0.11 or 1.0 brokers. On Thu, Nov 30, 2017 at 1:12 PM, Cody Koeninger wrote: > Are you talking about the broker version, or the kafka-clients artifact > version? > > On

Re: Kafka version support

2017-11-30 Thread Michael Armbrust
I would expect that to work. On Wed, Nov 29, 2017 at 10:17 PM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > Just wondering if anyone has tried spark structured streaming kafka > connector (2.2) with Kafka 0.11 or Kafka 1.0 version > > Thanks > Raghav >

Re: Spark Data Frame. PreSorded partitions

2017-11-28 Thread Michael Artz
I'm not sure other than retrieving from a hive table that is already sorted. This sounds cool though, would be interested to know this as well On Nov 28, 2017 10:40 AM, "Николай Ижиков" wrote: > Hello, guys! > > I work on implementation of custom DataSource for Spark

build spark source code

2017-11-22 Thread Michael Artz
It would be nice if I could download the source code of spark from github, then build it with sbt on my windows machine, and use IntelliJ to make little modifications to the code base. I have installed spark before on windows quite a few times, but I just use the packaged artifact. Has anyone

Caching dataframes and overwrite

2017-11-21 Thread Michael Artz
I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs. There is a very similar post to

Re: Generate windows on processing time in Spark Structured Streaming

2017-11-10 Thread Michael Armbrust
Hmmm, we should allow that. current_timestamp() is acutally deterministic within any given batch. Could you open a JIRA ticket? On Fri, Nov 10, 2017 at 1:52 AM, wangsan wrote: > Hi all, > > How can I use current processing time to generate windows in streaming > processing? >

Re: [Spark Structured Streaming] Changing partitions of (flat)MapGroupsWithState

2017-11-08 Thread Michael Armbrust
The relevant config is spark.sql.shuffle.partitions. Note that once you start a query, this number is fixed. The config will only affect queries starting from an empty checkpoint. On Wed, Nov 8, 2017 at 7:34 AM, Teemu Heikkilä wrote: > I have spark structured streaming job

Re: Structured Stream equivalent of reduceByKey

2017-11-06 Thread Michael Armbrust
ode-org.apache.spark.sql.Encoder-org.apache.spark.sql.Encoder-org.apache.spark.sql.streaming.GroupStateTimeout-> probably. On Thu, Oct 26, 2017 at 10:11 PM, Piyush Mukati <piyush.muk...@gmail.com> wrote: > Thanks, Michael > I have explored Aggregator > <https://spark.apache.o

Re: Read parquet files as buckets

2017-11-01 Thread Michael Artz
; > And code for the read : > val df = sparkSession.read.parquet(path).toDF() > > The read code run on other cluster than the write . > > > > > On Tue, Oct 31, 2017 at 7:02 PM Michael Artz <michaelea...@gmail.com> > wrote: > >> What version of sp

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust
- dev I think you should be able to write an Aggregator . You probably want to run in update mode if you are looking for it to output any group that has changed in the batch. On Wed, Oct 25,

Re: Spark - Partitions

2017-10-17 Thread Michael Artz
Have you tried caching it and using a coalesce? On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" wrote: > I tried repartitions but spark.sql.shuffle.partitions is taking up > precedence over repartitions or coalesce. how to get the lesser number of > files with same

Re: Implement Dataset reader from SEQ file with protobuf to Dataset

2017-10-08 Thread Michael Armbrust
spark-avro would be a good example to start with. On Sun, Oct 8, 2017 at 3:00 AM, Serega Sheypak wrote: > Hi, did anyone try to implement Spark SQL dataset reader from SEQ file > with protobuf inside to Dataset? > > Imagine I

Re: Multiple filters vs multiple conditions

2017-10-03 Thread Michael Artz
Hi Ahmed, Depending on which version you have it could matter. We received an email about multiple conditions in the filter not being picked up. I copied the email below that was sent out the the spark user list. The use never tried multiple one condition filters which might have worked. Hi

Re: Upgraded to spark 2.2 and get Guava error

2017-09-28 Thread Michael C. Kunkel
Greetings, I also noticed that this error does not appear if I spark-submit. But I rather know how to solve this error so I can do simple testing in eclipse. Thanks BR MK Michael C. Kunkel, USMC, PhD Forschungszentrum Jülich Nuclear Physics Institute

Re: Upgraded to spark 2.2 and get Guava error

2017-09-28 Thread Michael C. Kunkel
Greetings, Sorry for the trouble, I had an old guava dependency in the pom file. Not sure how I missed it. BR MK Michael C. Kunkel, USMC, PhD Forschungszentrum Jülich Nuclear Physics Institute and Juelich Center for Hadron Physics Experimental Hadron

Re: Chaining Spark Streaming Jobs

2017-09-18 Thread Michael Armbrust
You specify the schema when loading a dataframe by calling spark.read.schema(...)... On Tue, Sep 12, 2017 at 4:50 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Hi Michael, > > I am wondering what I am doing wrong. I get error like: > > Exce

Re: [SS] Any way to optimize memory consumption of SS?

2017-09-14 Thread Michael Armbrust
rt() > > query.awaitTermination() > > *and I use play json to parse input logs from kafka ,the parse function is > like* > > def parseFunction(str: String): (Long, String) = { > val json = Json.parse(str) > val timestamp = (json \ "time").get.toString(

Re: [SS]How to add a column with custom system time?

2017-09-14 Thread Michael Armbrust
.groupBy($"date") >>> .count() >>> .withColumn("window", window(current_timestamp(), "15 minutes")) >>> >>> /** >>> * output >>> */ >>> val query = results >>> .writeSt

Re: [Structured Streaming] Multiple sources best practice/recommendation

2017-09-14 Thread Michael Armbrust
I would probably suggest that you partition by format (though you can get the file name from the build in function input_file_name()). You can load multiple streams from different directories and union them together as long as the schema is the same after parsing. Otherwise you can just run

Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak
You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil Shinde

Re: [SS]How to add a column with custom system time?

2017-09-12 Thread Michael Armbrust
Can you show all the code? This works for me. On Tue, Sep 12, 2017 at 12:05 AM, 张万新 <kevinzwx1...@gmail.com> wrote: > The spark version is 2.2.0 > > Michael Armbrust <mich...@databricks.com>于2017年9月12日周二 下午12:32写道: > >> Which version of spark? >> >

Re: [SS] Any way to optimize memory consumption of SS?

2017-09-12 Thread Michael Armbrust
Can you show the full query you are running? On Tue, Sep 12, 2017 at 10:11 AM, 张万新 wrote: > Hi, > > I'm using structured streaming to count unique visits of our website. I > use spark on yarn mode with 4 executor instances and from 2 cores * 5g > memory to 4 cores * 10g

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread Michael Armbrust
inistic expressions are only allowed in > > Project, Filter, Aggregate or Window" > > Can you give more advice? > > Michael Armbrust <mich...@databricks.com>于2017年9月12日周二 上午4:48写道: > >> import org.apache.spark.sql.functions._ >> >> df.withColumn("w

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-11 Thread Michael Armbrust
The following will convert the whole row to JSON. import org.apache.spark.sql.functions.* df.select(to_json(struct(col("*" On Sat, Sep 9, 2017 at 6:27 PM, kant kodali wrote: > Thanks Ryan! In this case, I will have Dataset so is there a way to > convert Row to Json

Re: Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread Michael Armbrust
Checkpoints record what has been processed for a specific query, and as such only need to be defined when writing (which is how you "start" a query). You can use the DataFrame created with readStream to start multiple queries, so it wouldn't really make sense to have a single checkpoint there.

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread Michael Armbrust
import org.apache.spark.sql.functions._ df.withColumn("window", window(current_timestamp(), "15 minutes")) On Mon, Sep 11, 2017 at 3:03 AM, 张万新 wrote: > Hi, > > In structured streaming how can I add a column to a dataset with current > system time aligned with 15

Re: Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz
Spark SQL should be your choice > > > On Mon, Aug 28, 2017 at 10:25 PM Michael Artz <michaelea...@gmail.com> > wrote: > >> Just to be clear, I'm referring to having Spark reading from a file, not >> from a Hive table. And it will have tungsten engine off heap seria

Re: Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz
Just to be clear, I'm referring to having Spark reading from a file, not from a Hive table. And it will have tungsten engine off heap serialization after 2.1, so if it was a test with like 1.63 it won't be as helpful. On Mon, Aug 28, 2017 at 10:50 AM, Michael Artz <michaelea...@gmail.com>

Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz
Hi, There isn't any good source to answer the question if Hive as an SQL-On-Hadoop engine just as fast as Spark SQL now? I just want to know if there has been a comparison done lately for HiveQL vs Spark SQL on Spark versions 2.1 or later. I have a large ETL process, with many table joins and

add me to email list

2017-08-28 Thread Michael Artz
Hi, Please add me to the email list Mike

Re: Joining 2 dataframes, getting result as nested list/structure in dataframe

2017-08-23 Thread Michael Armbrust
You can create a nested struct that contains multiple columns using struct(). Here's a pretty complete guide on working with nested data: https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html On Wed, Aug 23, 2017 at 2:30 PM, JG Perrin

Re: Chaining Spark Streaming Jobs

2017-08-23 Thread Michael Armbrust
If you use structured streaming and the file sink, you can have a subsequent stream read using the file source. This will maintain exactly once processing even if there are hiccups or failures. On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind wrote: > Hello Spark Experts,

Re: Question on how to get appended data from structured streaming

2017-08-20 Thread Michael Armbrust
What is your end goal? Right now the foreach writer is the way to do arbitrary processing on the data produced by various output modes. On Sun, Aug 20, 2017 at 12:23 PM, Yanpeng Lin wrote: > Hello, > > I am new to Spark. > It would be appreciated if anyone could help me

  1   2   3   4   5   6   7   8   9   10   >