why spark oom off-heap?

2019-09-19 Thread jib...@qq.com
hello,Why spark usually off-heap oom when shuffle reader? I read some source 
code , When a ResultTask read shuffle data from no-local executor,it has buffer 
and spill disk,so why still off-heap oom?



jib...@qq.com


Re: Low cache hit ratio when running Spark on Alluxio

2019-09-19 Thread Bin Fan
Depending on the Alluxio version you are running, e..g, for 2.0, the
metrics of the local short-circuit read is not turned on by default.
So I would suggest you to first turn on the metrics collecting local
short-circuit reads by setting
alluxio.user.metrics.collection.enabled=true

Regarding the generic question to achieve high data locality when running
Spark on Alluxio, can you read
this article
https://www.alluxio.io/blog/top-10-tips-for-making-the-spark-alluxio-stack-blazing-fast/
and follow the suggests there. E.g., things can be weird on running Spark
on YARN for this case.

If you need more detailed instructions, feel free to join Alluxio community
channel https://slackin.alluxio.io 

- Bin Fan
alluxio.io  | powered by  | Data
Orchestration Summit 2019


On Wed, Aug 28, 2019 at 1:49 AM Jerry Yan  wrote:

> Hi,
>
> We are running Spark jobs on an Alluxio Cluster which is serving 13
> gigabytes of data with 99% of the data is in memory. I was hoping to speed
> up the Spark jobs by reading the in-memory data in Alluxio, but found
> Alluxio local hit rate is only 1.68%, while Alluxio remote hit rate is
> 98.32%. By monitoring the network IO across all worker nodes through
> "dstat" command, I found that only two nodes had about 1GB of recv or send
> in the whole precessand, and it is sending  1GB or receiving 1GB during
> Spark Shuffle Stage. Is there any metrics I could check or configuration
> to tune ?
>
>
> Best,
>
> Jerry
>


Re: Can I set the Alluxio WriteType in Spark applications?

2019-09-19 Thread Bin Fan
Hi Mark,

You can follow the instructions here:
https://docs.alluxio.io/os/user/stable/en/compute/Spark.html#customize-alluxio-user-properties-for-individual-spark-jobs

Something like this:

$ spark-submit \--conf
'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH'
\--conf 
'spark.executor.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH'
\...


Hope it helps

- Bin

On Tue, Sep 17, 2019 at 7:53 AM Mark Zhao  wrote:

> Hi,
>
> If Spark applications write data into alluxio, can WriteType be configured?
>
> Thanks,
> Mark
>
>


Parquet read performance for different schemas

2019-09-19 Thread Tomas Bartalos
Hello,

I have 2 parquets (each containing 1 file):

   - parquet-wide - schema has 25 top level cols + 1 array
   - parquet-narrow - schema has 3 top level cols

Both files have same data for given columns.
When I read from parquet-wide spark reports* read 52.6 KB*, from
parquet-narrow *only 2.6 KB*.
For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say
reading narrow parquet is much faster.

Since schema pruning is applied I *expected to get similar results* for
both scenarios (timing and amount of data read).
What do you think is the reason for such a big difference, is there any
tuning I can do ?

Thank you,
Tomas


Re: [External]Re: spark 2.x design docs

2019-09-19 Thread yeikel valdes
I am also interested. Many of the docs/books that I've seen are 
practical/examples about usage rather than deep internals of Spark.




 On Wed, 18 Sep 2019 21:12:12 -1100 vipul.s.p...@gmail.com wrote 


Yes,

I realize what you were looking for, I am also looking for the same docs. 
Haven't found em yet. Also, jacek laskowski's gitbooks are the next best thing 
to follow. If you haven't yet.


Regards


On Thu, Sep 19, 2019 at 12:46 PM  wrote:


Thanks Vipul,

 

I was looking specifically for documents spark committer use for reference.

 

Currently I’ve put custom logs in spark-core sources then building and running 
jobs on it.

Form printed logs I try to understand execution flows.

 

From: Vipul Rajan 
Sent: Thursday, September 19, 2019 12:23 PM
To: Kamal7 Kumar 
Cc: spark-user 
Subject: [External]Re: spark 2.x design docs

 

The e-mail below is from an external source. Please do not open attachments or 
click links from an unknown or suspicious origin.

https://github.com/JerryLead/SparkInternals/blob/master/EnglishVersion/2-JobLogicalPlan.md
This is pretty old. but it might help a little bit. I myself am going through 
the source code and trying to reverse engineer stuff. Let me know if you'd like 
to pool resources sometime.

 

Regards

 

On Thu, Sep 19, 2019 at 11:35 AM  wrote:

Hi ,

Can someone provide documents/links (apart from official documentation) for 
understanding internal workings of spark-core,

Document containing components pseudo codes, class diagrams, execution flows , 
etc.

Thanks, Kamal


"Confidentiality Warning: This message and any attachments are intended only 
for the use of the intended recipient(s), are confidential and may be 
privileged. If you are not the intended recipient, you are hereby notified that 
any review, re-transmission, conversion to hard copy, copying, circulation or 
other use of this message and any attachments is strictly prohibited. If you 
are not the intended recipient, please notify the sender immediately by return 
email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure 
no viruses are present in this email. The company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachment."


"Confidentiality Warning: This message and any attachments are intended only 
for the use of the intended recipient(s), are confidential and may be 
privileged. If you are not the intended recipient, you are hereby notified that 
any review, re-transmission, conversion to hard copy, copying, circulation or 
other use of this message and any attachments is strictly prohibited. If you 
are not the intended recipient, please notify the sender immediately by return 
email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure 
no viruses are present in this email. The company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachment."

Incorrect results in left_outer join in DSv2 implementation with filter pushdown - spark 2.3.2

2019-09-19 Thread Shubham Chaurasia
Hi,

Consider the following statements:

1)
> scala> val df = spark.read.format("com.shubham.MyDataSource").load
> scala> df.show
> +---+---+
> |  i|  j|
> +---+---+
> |  0|  0|
> |  1| -1|
> |  2| -2|
> |  3| -3|
> |  4| -4|
> +---+---+
> 2)
> scala> val df1 = df.filter("i < 3")
> scala> df1.show
> +---+---+
> |  i|  j|
> +---+---+
> |  0|  0|
> |  1| -1|
> |  2| -2|
> +---+---+
> 3)
> scala> df.join(df1, Seq("i"), "left_outer").show
> +---+---+---+
> |  i|  j|  j|
> +---+---+---+
> |  1| -1| -1|
> |  2| -2| -2|
> |  0|  0|  0|
> +---+---+---+


3) is not producing the right results for left_outer join.

Here is the minimal code.

---

public class MyDataSourceReader implements DataSourceReader,
SupportsPushDownFilters {

  private Filter[] pushedFilters = new Filter[0];
  private boolean hasFilters = false;

  public MyDataSourceReader(Map options) {
System.out.println("MyDataSourceReader.MyDataSourceReader:
Instantiated" + this);
  }

  @Override
  public StructType readSchema() {
return (new StructType())
.add("i", "int")
.add("j", "int");
  }

  @Override
  public Filter[] pushFilters(Filter[] filters) {
System.out.println("MyDataSourceReader.pushFilters: " +
Arrays.toString(filters));
hasFilters = true;
pushedFilters = filters;
// filter's that can't be pushed down.
return new Filter[0];
  }

  @Override
  public Filter[] pushedFilters() {
return pushedFilters;
  }

  @Override
  public List> createDataReaderFactories() {

System.out.println("===MyDataSourceReader.createBatchDataReaderFactories===");
int ltFilter = Integer.MAX_VALUE;
if (hasFilters) {
  ltFilter = getLTFilter("i");
}
hasFilters = false;
return Lists.newArrayList(new SimpleDataReaderFactory(0, 5, ltFilter));
  }

  private int getLTFilter(String attributeName) {
int filterValue = Integer.MAX_VALUE;
for (Filter pushedFilter : pushedFilters) {
  if (pushedFilter instanceof LessThan) {
LessThan lt = (LessThan) pushedFilter;
if (lt.attribute().equals(attributeName)) {
  filterValue = (int) lt.value();
}
  }
}
return filterValue;
  }

}



public class SimpleDataReaderFactory implements DataReaderFactory {

  private final int start;
  private final int end;
  private int current;
  private final int iLTFilter;

  public SimpleDataReaderFactory(int start, int end, int iLTFilter) {
this.start = start;
this.end = end;
this.iLTFilter = iLTFilter;
  }

  @Override
  public DataReader createDataReader() {
return new SimpleDataReader(start, end, iLTFilter);
  }

  public static class SimpleDataReader implements DataReader {
private final int start;
private final int end;
private int current;
private int iLTFilter;

public SimpleDataReader(int start, int end, int iLTFilter) {
  this.start = start;
  this.end = end;
  this.current = start - 1;
  this.iLTFilter = iLTFilter;
}
@Override
public boolean next() {
  current++;
  return current < end && current < iLTFilter ;
}
@Override
public Row get() {
  return new GenericRow(new Object[]{current, -current});
}
@Override
public void close() {
}
  }
}



It seems that somehow spark is applying filter (i < 3) after left_join
operation too because of which we see incorrect results in 3).
However I don't see any filter node after join in plan.

== Physical Plan ==
> *(5) Project [i#136, j#137, j#228]
> +- SortMergeJoin [i#136], [i#227], LeftOuter
>:- *(2) Sort [i#136 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(i#136, 200)
>: +- *(1) DataSourceV2Scan [i#136, j#137],
> com.shubham.reader.MyDataSourceReader@714bd7ad
>+- *(4) Sort [i#227 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [i#227, j#228], Exchange hashpartitioning(i#136,
> 200)


Any ideas what might be going wrong?

Thanks,
Shubham


[no subject]

2019-09-19 Thread Georg Heiler
Hi,

How can I create an initial state by hands so that structured streaming
files source only reads data which is semantically (i.e. using a file path
lexicographically) greater than the minimum committed initial state?

Details here:
https://stackoverflow.com/questions/58004832/spark-structured-streaming-file-source-read-from-a-certain-partition-onwards

Best,
Georg


Re: [External]Re: spark 2.x design docs

2019-09-19 Thread Vipul Rajan
Yes,

I realize what you were looking for, I am also looking for the same docs.
Haven't found em yet. Also, jacek laskowski's gitbooks are the next best
thing to follow. If you haven't yet.

Regards

On Thu, Sep 19, 2019 at 12:46 PM  wrote:

> Thanks Vipul,
>
>
>
> I was looking specifically for documents spark committer use for reference.
>
>
>
> Currently I’ve put custom logs in spark-core sources then building and
> running jobs on it.
>
> Form printed logs I try to understand execution flows.
>
>
>
> *From:* Vipul Rajan 
> *Sent:* Thursday, September 19, 2019 12:23 PM
> *To:* Kamal7 Kumar 
> *Cc:* spark-user 
> *Subject:* [External]Re: spark 2.x design docs
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
>
> https://github.com/JerryLead/SparkInternals/blob/master/EnglishVersion/2-JobLogicalPlan.md
> This is pretty old. but it might help a little bit. I myself am going
> through the source code and trying to reverse engineer stuff. Let me know
> if you'd like to pool resources sometime.
>
>
>
> Regards
>
>
>
> On Thu, Sep 19, 2019 at 11:35 AM  wrote:
>
> Hi ,
>
> Can someone provide documents/links (apart from official documentation) *for
> understanding internal workings of spark-core*,
>
> Document containing components pseudo codes, class diagrams, execution
> flows , etc.
>
> Thanks, Kamal
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>


unsubscribe

2019-09-19 Thread Mario Amatucci



RE: [External]Re: spark 2.x design docs

2019-09-19 Thread Kamal7.Kumar
Thanks Vipul,

I was looking specifically for documents spark committer use for reference.

Currently I’ve put custom logs in spark-core sources then building and running 
jobs on it.
Form printed logs I try to understand execution flows.

From: Vipul Rajan 
Sent: Thursday, September 19, 2019 12:23 PM
To: Kamal7 Kumar 
Cc: spark-user 
Subject: [External]Re: spark 2.x design docs


The e-mail below is from an external source. Please do not open attachments or 
click links from an unknown or suspicious origin.
https://github.com/JerryLead/SparkInternals/blob/master/EnglishVersion/2-JobLogicalPlan.md
This is pretty old. but it might help a little bit. I myself am going through 
the source code and trying to reverse engineer stuff. Let me know if you'd like 
to pool resources sometime.

Regards

On Thu, Sep 19, 2019 at 11:35 AM 
mailto:kamal7.ku...@ril.com>> wrote:
Hi ,
Can someone provide documents/links (apart from official documentation) for 
understanding internal workings of spark-core,
Document containing components pseudo codes, class diagrams, execution flows , 
etc.
Thanks, Kamal

"Confidentiality Warning: This message and any attachments are intended only 
for the use of the intended recipient(s), are confidential and may be 
privileged. If you are not the intended recipient, you are hereby notified that 
any review, re-transmission, conversion to hard copy, copying, circulation or 
other use of this message and any attachments is strictly prohibited. If you 
are not the intended recipient, please notify the sender immediately by return 
email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure 
no viruses are present in this email. The company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachment."
"Confidentiality Warning: This message and any attachments are intended only 
for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. 
you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other 
use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the 
sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure 
no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from 
the use of this email or attachment."


Re: spark 2.x design docs

2019-09-19 Thread Vipul Rajan
https://github.com/JerryLead/SparkInternals/blob/master/EnglishVersion/2-JobLogicalPlan.md
This is pretty old. but it might help a little bit. I myself am going
through the source code and trying to reverse engineer stuff. Let me know
if you'd like to pool resources sometime.

Regards

On Thu, Sep 19, 2019 at 11:35 AM  wrote:

> Hi ,
>
> Can someone provide documents/links (apart from official documentation) *for
> understanding internal workings of spark-core*,
>
> Document containing components pseudo codes, class diagrams, execution
> flows , etc.
>
> Thanks, Kamal
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>


spark 2.x design docs

2019-09-19 Thread Kamal7.Kumar
Hi ,
Can someone provide documents/links (apart from official documentation) for 
understanding internal workings of spark-core,
Document containing components pseudo codes, class diagrams, execution flows , 
etc.
Thanks, Kamal
"Confidentiality Warning: This message and any attachments are intended only 
for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. 
you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other 
use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the 
sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure 
no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from 
the use of this email or attachment."