Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread Michael Artz
What would you do with it once you get it into driver in a Dataset[Row]? Sent from my iPhone > On Apr 22, 2020, at 3:06 AM, maqy <454618...@qq.com> wrote: > >  > When the data is stored in the Dataset [Row] format, the memory usage is very > small. > When I use collect () to collect data to

Re: Lightweight pipeline execution for single eow

2018-09-23 Thread Michael Artz
Are you using the scheduler in fair mode instead of fifo mode? Sent from my iPhone > On Sep 22, 2018, at 12:58 AM, Jatin Puri wrote: > > Hi. > > What tactics can I apply for such a scenario. > > I have a pipeline of 10 stages. Simple text processing. I train the data with > the pipeline

Re: Pitfalls of partitioning by host?

2018-08-27 Thread Michael Artz
Well if we think of shuffling as a necessity to perform an operation, then the problem would be that you are adding a ln aggregation stage to a job that is going to get shuffled anyway. Like if you need to join two datasets, then Spark will still shuffle the data, whether they are grouped by

Re: Dataframe vs dataset

2018-05-01 Thread Michael Artz
a > woman is a subset of a human. > > > > All DataFrames are DataSets. Not all Datasets are DataFrames. The “subset” > relationship doesn’t apply here. A DataFrame is a specialized type of > DataSet > > > > *From: *Michael Artz <michaelea...@gmail.com> > *

Re: Dataframe vs dataset

2018-04-28 Thread Michael Artz
datasets as typed df and therefore ds are enhanced df > Feel free to disagree.. > Kr > > On Sat, Apr 28, 2018, 2:24 PM Michael Artz <michaelea...@gmail.com> wrote: > >> Hi, >> >> I use Spark everyday and I have a good grip on the basics of Spark, so >> this qu

Dataframe vs dataset

2018-04-28 Thread Michael Artz
Hi, I use Spark everyday and I have a good grip on the basics of Spark, so this question isnt for myself. But this came up and I wanted to see what other Spark users would say, and I dont want to influence your answer. And SO is weird about polls. The question is "Which one do you feel is

Re: Return statements aren't allowed in Spark closures

2018-02-22 Thread Michael Artz
I am not able to reproduce your error. You should do something before you do that last function and maybe get some more help from the exception it returns. Like just add a csv.show (1) on the line before. Also, can you post the different exception when you took out the "return" value like when

Spark multithreaded job submission from driver

2017-12-14 Thread Michael Artz
Hi, I was wanting to pull data from about 1500 remote Oracle tables with Spark, and I want to have a multi-threaded application that picks up a table per thread or maybe 10 tables per thread and launches a spark job to read from their respective tables. I read official spark site

Re: Spark Data Frame. PreSorded partitions

2017-11-28 Thread Michael Artz
I'm not sure other than retrieving from a hive table that is already sorted. This sounds cool though, would be interested to know this as well On Nov 28, 2017 10:40 AM, "Николай Ижиков" wrote: > Hello, guys! > > I work on implementation of custom DataSource for Spark

build spark source code

2017-11-22 Thread Michael Artz
It would be nice if I could download the source code of spark from github, then build it with sbt on my windows machine, and use IntelliJ to make little modifications to the code base. I have installed spark before on windows quite a few times, but I just use the packaged artifact. Has anyone

Caching dataframes and overwrite

2017-11-21 Thread Michael Artz
I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs. There is a very similar post to

Re: Read parquet files as buckets

2017-11-01 Thread Michael Artz
; > And code for the read : > val df = sparkSession.read.parquet(path).toDF() > > The read code run on other cluster than the write . > > > > > On Tue, Oct 31, 2017 at 7:02 PM Michael Artz <michaelea...@gmail.com> > wrote: > >> What version of sp

Re: Spark - Partitions

2017-10-17 Thread Michael Artz
Have you tried caching it and using a coalesce? On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" wrote: > I tried repartitions but spark.sql.shuffle.partitions is taking up > precedence over repartitions or coalesce. how to get the lesser number of > files with same

Re: Multiple filters vs multiple conditions

2017-10-03 Thread Michael Artz
Hi Ahmed, Depending on which version you have it could matter. We received an email about multiple conditions in the filter not being picked up. I copied the email below that was sent out the the spark user list. The use never tried multiple one condition filters which might have worked. Hi

Re: Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz
Spark SQL should be your choice > > > On Mon, Aug 28, 2017 at 10:25 PM Michael Artz <michaelea...@gmail.com> > wrote: > >> Just to be clear, I'm referring to having Spark reading from a file, not >> from a Hive table. And it will have tungsten engine off heap seria

Re: Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz
Just to be clear, I'm referring to having Spark reading from a file, not from a Hive table. And it will have tungsten engine off heap serialization after 2.1, so if it was a test with like 1.63 it won't be as helpful. On Mon, Aug 28, 2017 at 10:50 AM, Michael Artz <michaelea...@gmail.com>

Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz
Hi, There isn't any good source to answer the question if Hive as an SQL-On-Hadoop engine just as fast as Spark SQL now? I just want to know if there has been a comparison done lately for HiveQL vs Spark SQL on Spark versions 2.1 or later. I have a large ETL process, with many table joins and

add me to email list

2017-08-28 Thread Michael Artz
Hi, Please add me to the email list Mike