date:20191111

how to limit tasks num when read hive with orc

2019-11-11 Thread lk_spark

hi,all: I have a hive table STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' , many files of it is very small , when I use spark to read it , thousands tasks will start , how can I limit the task num ? 2019-11-12 lk_spark

Is RDD thread safe?

2019-11-11 Thread Chang Chen

Hi all I meet a case where I need cache a source RDD, and then create different DataFrame from it in different threads to accelerate query. I know that SparkSession is thread safe( https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure whether RDD si thread safe or not Thanks

Re: What is directory "/path/_spark_metadata" for?

2019-11-11 Thread Bin Fan

Hey Mark, I believe this is the name of the subdirectory that is used to store metadata about which files are valid, see comment in code https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L33 Do you see the exception

Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov

I don't think the Spark configuration is what you want to focus on. It's hard to say without knowing the specifics of the job or the data volume, but you should be able to accomplish this with the percent_rank function in SparkSQL and a smart partitioning of the data. If your data has a lot of

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-11 Thread Vadim Semenov

There's an umbrella ticket for various 2GB limitations https://issues.apache.org/jira/browse/SPARK-6235 On Fri, Nov 8, 2019 at 4:11 PM Jacob Lynn wrote: > > Sorry for the noise, folks! I understand that reducing the number of > partitions works around the issue (at the scale I'm working at,

Re: Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File

Currently, I'm using the percentile approx function with Hive. I'm looking for a better way to run this function or another way to get the same result with spark, but faster and not using gigantic instances.. I'm trying to optimize this job by changing the Spark configuration. If you have any

Re: Why Spark generates Java code and not Scala?

2019-11-11 Thread Marcin Tustin

Well TIL. For those also newly informed: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-whole-stage-codegen.html https://mail-archives.apache.org/mod_mbox/spark-dev/201911.mbox/browser On Sun, Nov 10, 2019 at 7:57 AM Holden Karau wrote: > *This Message originated outside

Re: Using Percentile in Spark SQL

2019-11-11 Thread Muthu Jayakumar

If you would require higher precision, you may have to write a custom udaf. In my case, I ended up storing the data as a key-value ordered list of histograms. Thanks Muthu On Mon, Nov 11, 2019, 20:46 Patrick McCarthy wrote: > Depending on your tolerance for error you could also use >

Re: Using Percentile in Spark SQL

2019-11-11 Thread Patrick McCarthy

Depending on your tolerance for error you could also use percentile_approx(). On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov wrote: > Do you mean that you are trying to compute the percent rank of some data? > You can use the SparkSQL percent_rank function for that, but I don't think > that's

Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov

Do you mean that you are trying to compute the percent rank of some data? You can use the SparkSQL percent_rank function for that, but I don't think that's going to give you any improvement over calling the percentRank function on the data frame. Are you currently using a user-defined function for

Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File

Hi, Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a percentile function. I'm trying to improve this job by moving it to run with spark SQL. Any suggestions on how to use a percentile function in Spark? Thanks, -- Tzahi File Data Engineer [image: ironSource]

Re: PySpark Pandas UDF

2019-11-11 Thread gal.benshlomo

Hi, Thanks for your reply. Tried what you've suggested and still getting the same error. Also worth mentioning that when I tried to simply write the dataframe to S3, without applying the function, it works. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

how to limit tasks num when read hive with orc

Is RDD thread safe?

Re: What is directory "/path/_spark_metadata" for?

Re: Using Percentile in Spark SQL

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

Re: Using Percentile in Spark SQL

Re: Why Spark generates Java code and not Scala?

Re: Using Percentile in Spark SQL

Re: Using Percentile in Spark SQL

Re: Using Percentile in Spark SQL

Using Percentile in Spark SQL

Re: PySpark Pandas UDF

12 matches

Site Navigation

Mail list logo

Footer information