why spark oom off-heap?

2019-09-19 Thread jib...@qq.com
hello,Why spark usually off-heap oom when shuffle reader? I read some source code , When a ResultTask read shuffle data from no-local executor,it has buffer and spill disk,so why still off-heap oom? jib...@qq.com

Re: Low cache hit ratio when running Spark on Alluxio

2019-09-19 Thread Bin Fan
Depending on the Alluxio version you are running, e..g, for 2.0, the metrics of the local short-circuit read is not turned on by default. So I would suggest you to first turn on the metrics collecting local short-circuit reads by setting alluxio.user.metrics.collection.enabled=true Regarding the

Re: Can I set the Alluxio WriteType in Spark applications?

2019-09-19 Thread Bin Fan
Hi Mark, You can follow the instructions here: https://docs.alluxio.io/os/user/stable/en/compute/Spark.html#customize-alluxio-user-properties-for-individual-spark-jobs Something like this: $ spark-submit \--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH'

Parquet read performance for different schemas

2019-09-19 Thread Tomas Bartalos
Hello, I have 2 parquets (each containing 1 file): - parquet-wide - schema has 25 top level cols + 1 array - parquet-narrow - schema has 3 top level cols Both files have same data for given columns. When I read from parquet-wide spark reports* read 52.6 KB*, from parquet-narrow *only 2.6

Re: [External]Re: spark 2.x design docs

2019-09-19 Thread yeikel valdes
I am also interested. Many of the docs/books that I've seen are practical/examples about usage rather than deep internals of Spark. On Wed, 18 Sep 2019 21:12:12 -1100 vipul.s.p...@gmail.com wrote Yes, I realize what you were looking for, I am also looking for the same docs.

Incorrect results in left_outer join in DSv2 implementation with filter pushdown - spark 2.3.2

2019-09-19 Thread Shubham Chaurasia
Hi, Consider the following statements: 1) > scala> val df = spark.read.format("com.shubham.MyDataSource").load > scala> df.show > +---+---+ > | i| j| > +---+---+ > | 0| 0| > | 1| -1| > | 2| -2| > | 3| -3| > | 4| -4| > +---+---+ > 2) > scala> val df1 = df.filter("i < 3") > scala> df1.show

[no subject]

2019-09-19 Thread Georg Heiler
Hi, How can I create an initial state by hands so that structured streaming files source only reads data which is semantically (i.e. using a file path lexicographically) greater than the minimum committed initial state? Details here:

Re: [External]Re: spark 2.x design docs

2019-09-19 Thread Vipul Rajan
Yes, I realize what you were looking for, I am also looking for the same docs. Haven't found em yet. Also, jacek laskowski's gitbooks are the next best thing to follow. If you haven't yet. Regards On Thu, Sep 19, 2019 at 12:46 PM wrote: > Thanks Vipul, > > > > I was looking specifically for

unsubscribe

2019-09-19 Thread Mario Amatucci

RE: [External]Re: spark 2.x design docs

2019-09-19 Thread Kamal7.Kumar
Thanks Vipul, I was looking specifically for documents spark committer use for reference. Currently I’ve put custom logs in spark-core sources then building and running jobs on it. Form printed logs I try to understand execution flows. From: Vipul Rajan Sent: Thursday, September 19, 2019

Re: spark 2.x design docs

2019-09-19 Thread Vipul Rajan
https://github.com/JerryLead/SparkInternals/blob/master/EnglishVersion/2-JobLogicalPlan.md This is pretty old. but it might help a little bit. I myself am going through the source code and trying to reverse engineer stuff. Let me know if you'd like to pool resources sometime. Regards On Thu, Sep

spark 2.x design docs

2019-09-19 Thread Kamal7.Kumar
Hi , Can someone provide documents/links (apart from official documentation) for understanding internal workings of spark-core, Document containing components pseudo codes, class diagrams, execution flows , etc. Thanks, Kamal "Confidentiality Warning: This message and any attachments are