Spark requires sysctl tuning? Servers unresponsive

2016-08-05 Thread Ruslan Dautkhanov
Hello,

When I start a spark notebook, it makes some of the servers exhausting some
Linux kernel resources, as I can't even ssh to those nodes.

And it's not due to servers being hammered. It happens when there are no
spark jobs/taks are running. To reproduce this problem, it's enough to just
start a spark notebook (and keep spark context up).

Spark version: 1.5 but I had this problem in previous versions of Spark too.
Hadoop 2.6. Spark on YARN.
There are 150 containers, each has 2 vcores.

There is plenty of memory (yarn reserved 1.3Tb of memory across 10 nodes
for those 150 containers). GC is miniscule. There are many cores in the
system
too (24-48 cores per node). Servers are RHEL 6.7,
on 2.6.32-573.22.1.el6.x86_64

We have monitoring running on the servers.
Number of open connections jumps up to 2500-4000 when Spark is up.
It's just around 200 when Spark is not running.

I have tried to bump up many of the /etc/sysctl.conf limiting parameters
that we
might be reaching. Current set of non-default sysctl settings - see *[1]*.
When those parameters were increased to those levels, it feels better,
but stiil problem happens.

After some time of running tasks, jobs start to fail. Looks like there
might be
some leaking netty connections or something like that?
Jobs fail with "Failed to connect to.." error - see *[2]*. But this happens
way later,
it's an epogee of the problem when it starts affecting jobs. Before that
Spark
itself exhausts some resources and we can see that as we can't even ssh to
some
of the servers.

Also tuned up ulimits in the system, but no change in the behavior.
Problem always reproducible.
See a limits.d conf file in *[3] *

We only have this problem when Spark is running. Hive etc running just fine.

Any ideas?

Thank you.



*[1] *
Current set of non-default sysctl settings:

net.ipv4.ip_forward = 0
> net.ipv4.conf.default.rp_filter = 1
> net.ipv4.conf.default.accept_source_route = 0
> kernel.sysrq = 0
> kernel.core_uses_pid = 1
> net.ipv4.tcp_syncookies = 1
> net.bridge.bridge-nf-call-ip6tables = 0
> net.bridge.bridge-nf-call-iptables = 0
> net.bridge.bridge-nf-call-arptables = 0
> kernel.msgmnb = 65536
> kernel.msgmax = 65536
> kernel.shmmax = 68719476736
> net.ipv6.conf.all.disable_ipv6=1
> net.ipv6.conf.default.disable_ipv6=1
> net.ipv6.conf.lo.disable_ipv6=1
> vm.swappiness = 4
> net.ipv4.ip_local_port_range = 2000 65535
> net.ipv4.tcp_tw_reuse = 1
> net.ipv4.tcp_max_syn_backlog = 4096
> net.core.somaxconn = 2048
> net.ipv4.tcp_synack_retries = 3
> net.ipv4.tcp_fin_timeout = 20
> net.ipv4.tcp_max_syn_backlog = 8096
> net.core.netdev_max_backlog = 16000
> net.core.rmem_max = 134217728
> net.core.wmem_max = 134217728
> net.ipv4.tcp_rmem = 4096 87380 134217728
> net.ipv4.tcp_wmem = 4096 65536 134217728
> net.core.rmem_default = 262144
> net.core.wmem_default = 262144



*[2]* Error stack :

An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure:



> Task 101 in stage 54.0 failed 4 times, most recent failure: Lost task
> 101.3 in stage 54.0 (TID 18504, abc.def.com): java.io.IOException: Failed
> to connect to abc.def.com/10.20.32.118:23078
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>


> Caused by: java.net.ConnectException: Connection refused:
> abc.def.com/10.20.32.118:23078
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Koert Kuipers
i think it limits the usability of with statement. and it could be somewhat
confusing because of this, so i would mention it in docs.

i like the idea though.

On Fri, Aug 5, 2016 at 7:04 PM, Nicholas Chammas  wrote:

> Good point.
>
> Do you think it's sufficient to note this somewhere in the documentation
> (or simply assume that user understanding of transformations vs. actions
> means they know this), or are there other implications that need to be
> considered?
>
> On Fri, Aug 5, 2016 at 6:50 PM Koert Kuipers  wrote:
>
>> The tricky part is that the action needs to be inside the with block, not
>> just the transformation that uses the persisted data.
>>
>> On Aug 5, 2016 1:44 PM, "Nicholas Chammas" 
>> wrote:
>>
>> Okie doke, I've filed a JIRA for this here: https://issues.apache.
>> org/jira/browse/SPARK-16921
>>
>> On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin  wrote:
>>
>>> Sounds like a great idea!
>>>
>>> On Friday, August 5, 2016, Nicholas Chammas 
>>> wrote:
>>>
 Context managers
 
 are a natural way to capture closely related setup and teardown code in
 Python.

 For example, they are commonly used when doing file I/O:

 with open('/path/to/file') as f:
 contents = f.read()
 ...

 Once the program exits the with block, f is automatically closed.

 Does it make sense to apply this pattern to persisting and unpersisting
 DataFrames and RDDs? I feel like there are many cases when you want to
 persist a DataFrame for a specific set of operations and then unpersist it
 immediately afterwards.

 For example, take model training. Today, you might do something like
 this:

 labeled_data.persist()
 model = pipeline.fit(labeled_data)
 labeled_data.unpersist()

 If persist() returned a context manager, you could rewrite this as
 follows:

 with labeled_data.persist():
 model = pipeline.fit(labeled_data)

 Upon exiting the with block, labeled_data would automatically be
 unpersisted.

 This can be done in a backwards-compatible way since persist() would
 still return the parent DataFrame or RDD as it does today, but add two
 methods to the object: __enter__() and __exit__()

 Does this make sense? Is it attractive?

 Nick
 ​

>>>
>>


Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Micah Kornfield
Hi Everyone,
I'm an Arrow contributor mostly on the C++ side of things, but I'll try to
give a brief update of where I believe the project currently is (the views
are my own, but hopefully are fairly accurate :).

I think in the long run the diagram mentioned by Jim, is were we would like
Arrow to be, but it is clearly not there yet there.   The ticket referenced
[2] was an actionable first step for using Arrow, not the end state (after
I finished a couple of more items in Arrow I was hoping to try to work on
that ticket, but that might be a little ways out still).

In terms of specification [1], the memory model specification seems to be
fairly stable but might undergo a couple of more tweaks (there is still
some discussion on making a first class string/binary column type  and how
we address endianness on different systems).  The RPC model has a first
draft for how to serialize types to a stream/memory space, but needs to be
fleshed out some more to deal the practicalities of resource management.

The Java implementation of Arrow was originally taken from the Apache Drill
code base and I think it is fairly close to conforming to the specification
[1] (if not already doing so).  But it is should be "mature" in the sense
that is being used in a real system.

The C++/Python code is still in development and I hope to have at least a
prototype showing some form of communication between a JVM and C++/Python
process over the next couple weeks.

The last time I looked at the spark code base, the main difference I saw
with Arrow versus the existing columnar memory structure in Spark was how
null flags and boolean values are stored.  Arrow bit-packs null
flags/boolean values. Spark seems to have one byte per value.  I did not
take a close look to see if the memory allocation APIs were compatible
between Spark and Arrow.

If people are interested, I for one would like to here feedback on the
current specifications/code for Arrow.  So please feel free to chime in the
Arrow-dev mailing list.

Thanks,
Micah

[1] https://github.com/apache/arrow/tree/master/format
[2] https://issues.apache.org/jira/browse/SPARK-13534

On Fri, Aug 5, 2016 at 4:07 PM, Nicholas Chammas  wrote:

> Don't know much about Spark + Arrow efforts myself; just wanted to share
> the reference.
>
> On Fri, Aug 5, 2016 at 6:53 PM Jim Pivarski  wrote:
>
>> On Fri, Aug 5, 2016 at 5:14 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534
>>>
>>
>> Thank you. This ticket describes output from Spark to Arrow for flat
>> (non-nested) tables. Are there no plans to input from Arrow to Spark for
>> general types? Did I misunderstand the blogs?
>>
>> I don't see it in this search: https://issues.apache.
>> org/jira/browse/SPARK-13534?jql=project%20%3D%20SPARK%
>> 20AND%20text%20~%20%22arrow%22
>>
>> I'm beginning to think there's some misleading information out there
>> (like this diagram: https://arrow.apache.org/img/shared2.png).
>>
>> -- JIm
>>
>>


Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Nicholas Chammas
Don't know much about Spark + Arrow efforts myself; just wanted to share
the reference.

On Fri, Aug 5, 2016 at 6:53 PM Jim Pivarski  wrote:

> On Fri, Aug 5, 2016 at 5:14 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534
>>
>
> Thank you. This ticket describes output from Spark to Arrow for flat
> (non-nested) tables. Are there no plans to input from Arrow to Spark for
> general types? Did I misunderstand the blogs?
>
> I don't see it in this search:
> https://issues.apache.org/jira/browse/SPARK-13534?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22arrow%22
>
> I'm beginning to think there's some misleading information out there (like
> this diagram: https://arrow.apache.org/img/shared2.png).
>
> -- JIm
>
>


Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Nicholas Chammas
Good point.

Do you think it's sufficient to note this somewhere in the documentation
(or simply assume that user understanding of transformations vs. actions
means they know this), or are there other implications that need to be
considered?

On Fri, Aug 5, 2016 at 6:50 PM Koert Kuipers  wrote:

> The tricky part is that the action needs to be inside the with block, not
> just the transformation that uses the persisted data.
>
> On Aug 5, 2016 1:44 PM, "Nicholas Chammas" 
> wrote:
>
> Okie doke, I've filed a JIRA for this here:
> https://issues.apache.org/jira/browse/SPARK-16921
>
> On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin  wrote:
>
>> Sounds like a great idea!
>>
>> On Friday, August 5, 2016, Nicholas Chammas 
>> wrote:
>>
>>> Context managers
>>> 
>>> are a natural way to capture closely related setup and teardown code in
>>> Python.
>>>
>>> For example, they are commonly used when doing file I/O:
>>>
>>> with open('/path/to/file') as f:
>>> contents = f.read()
>>> ...
>>>
>>> Once the program exits the with block, f is automatically closed.
>>>
>>> Does it make sense to apply this pattern to persisting and unpersisting
>>> DataFrames and RDDs? I feel like there are many cases when you want to
>>> persist a DataFrame for a specific set of operations and then unpersist it
>>> immediately afterwards.
>>>
>>> For example, take model training. Today, you might do something like
>>> this:
>>>
>>> labeled_data.persist()
>>> model = pipeline.fit(labeled_data)
>>> labeled_data.unpersist()
>>>
>>> If persist() returned a context manager, you could rewrite this as
>>> follows:
>>>
>>> with labeled_data.persist():
>>> model = pipeline.fit(labeled_data)
>>>
>>> Upon exiting the with block, labeled_data would automatically be
>>> unpersisted.
>>>
>>> This can be done in a backwards-compatible way since persist() would
>>> still return the parent DataFrame or RDD as it does today, but add two
>>> methods to the object: __enter__() and __exit__()
>>>
>>> Does this make sense? Is it attractive?
>>>
>>> Nick
>>> ​
>>>
>>
>


Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Jim Pivarski
On Fri, Aug 5, 2016 at 5:14 PM, Nicholas Chammas  wrote:

> Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534
>

Thank you. This ticket describes output from Spark to Arrow for flat
(non-nested) tables. Are there no plans to input from Arrow to Spark for
general types? Did I misunderstand the blogs?

I don't see it in this search:
https://issues.apache.org/jira/browse/SPARK-13534?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22arrow%22

I'm beginning to think there's some misleading information out there (like
this diagram: https://arrow.apache.org/img/shared2.png).

-- JIm


Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Koert Kuipers
The tricky part is that the action needs to be inside the with block, not
just the transformation that uses the persisted data.

On Aug 5, 2016 1:44 PM, "Nicholas Chammas" 
wrote:

Okie doke, I've filed a JIRA for this here: https://issues.apache.
org/jira/browse/SPARK-16921

On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin  wrote:

> Sounds like a great idea!
>
> On Friday, August 5, 2016, Nicholas Chammas 
> wrote:
>
>> Context managers
>> 
>> are a natural way to capture closely related setup and teardown code in
>> Python.
>>
>> For example, they are commonly used when doing file I/O:
>>
>> with open('/path/to/file') as f:
>> contents = f.read()
>> ...
>>
>> Once the program exits the with block, f is automatically closed.
>>
>> Does it make sense to apply this pattern to persisting and unpersisting
>> DataFrames and RDDs? I feel like there are many cases when you want to
>> persist a DataFrame for a specific set of operations and then unpersist it
>> immediately afterwards.
>>
>> For example, take model training. Today, you might do something like this:
>>
>> labeled_data.persist()
>> model = pipeline.fit(labeled_data)
>> labeled_data.unpersist()
>>
>> If persist() returned a context manager, you could rewrite this as
>> follows:
>>
>> with labeled_data.persist():
>> model = pipeline.fit(labeled_data)
>>
>> Upon exiting the with block, labeled_data would automatically be
>> unpersisted.
>>
>> This can be done in a backwards-compatible way since persist() would
>> still return the parent DataFrame or RDD as it does today, but add two
>> methods to the object: __enter__() and __exit__()
>>
>> Does this make sense? Is it attractive?
>>
>> Nick
>> ​
>>
>


Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Nicholas Chammas
Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534
2016년 8월 5일 (금) 오후 5:22, Holden Karau 님이 작성:

> I don't think there is an approximate timescale right now and its likely
> any implementation would depend on a solid Java implementation of Arrow
> being ready first (or even a guarantee that it necessarily will - although
> I'm interested in making it happen in some places where it makes sense).
>
> On Fri, Aug 5, 2016 at 2:18 PM, Jim Pivarski  wrote:
>
>> I see. I've already started working with Arrow-C++ and talking to members
>> of the Arrow community, so I'll keep doing that.
>>
>> As a follow-up question, is there an approximate timescale for when Spark
>> will support Arrow? I'd just like to know that all the pieces will come
>> together eventually.
>>
>> (In this forum, most of the discussion about Arrow is about PySpark and
>> Pandas, not Spark in general.)
>>
>> Best,
>> Jim
>>
>> On Aug 5, 2016 2:43 PM, "Holden Karau"  wrote:
>>
>>> Spark does not currently support Apache Arrow - probably a good place to
>>> chat would be on the Arrow mailing list where they are making progress
>>> towards unified JVM & Python/R support which is sort of a precondition of a
>>> functioning Arrow interface between Spark and Python.
>>>
>>> On Fri, Aug 5, 2016 at 12:40 PM, jpivar...@gmail.com <
>>> jpivar...@gmail.com> wrote:
>>>
 In a few earlier posts [ 1
 <
 http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-td13898.html
 >
 ] [ 2
 <
 http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-td17701.html
 >
 ], I asked about moving data from C++ into a Spark data source (RDD,
 DataFrame, or Dataset). The issue is that even the off-heap cache might
 not
 have a stable representation: it might change from one version to the
 next.

 I recently learned about Apache Arrow, a data layer that Spark
 currently or
 will someday share with Pandas, Impala, etc. Suppose that I can fill a
 buffer (such as a direct ByteBuffer) with Arrow-formatted data, is
 there an
 easy--- or even zero-copy--- way to use that in Spark? Is that an API
 that
 could be developed?

 I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good
 place to
 ask this question?

 Thanks,
 -- Jim




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Arrow-data-in-buffer-to-RDD-DataFrame-Dataset-tp18563.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Holden Karau
I don't think there is an approximate timescale right now and its likely
any implementation would depend on a solid Java implementation of Arrow
being ready first (or even a guarantee that it necessarily will - although
I'm interested in making it happen in some places where it makes sense).

On Fri, Aug 5, 2016 at 2:18 PM, Jim Pivarski  wrote:

> I see. I've already started working with Arrow-C++ and talking to members
> of the Arrow community, so I'll keep doing that.
>
> As a follow-up question, is there an approximate timescale for when Spark
> will support Arrow? I'd just like to know that all the pieces will come
> together eventually.
>
> (In this forum, most of the discussion about Arrow is about PySpark and
> Pandas, not Spark in general.)
>
> Best,
> Jim
>
> On Aug 5, 2016 2:43 PM, "Holden Karau"  wrote:
>
>> Spark does not currently support Apache Arrow - probably a good place to
>> chat would be on the Arrow mailing list where they are making progress
>> towards unified JVM & Python/R support which is sort of a precondition of a
>> functioning Arrow interface between Spark and Python.
>>
>> On Fri, Aug 5, 2016 at 12:40 PM, jpivar...@gmail.com > > wrote:
>>
>>> In a few earlier posts [ 1
>>> >> ungsten-off-heap-memory-access-for-C-libraries-td13898.html>
>>> ] [ 2
>>> >> ow-to-access-the-off-heap-representation-of-cached-data-in-
>>> Spark-2-0-td17701.html>
>>> ], I asked about moving data from C++ into a Spark data source (RDD,
>>> DataFrame, or Dataset). The issue is that even the off-heap cache might
>>> not
>>> have a stable representation: it might change from one version to the
>>> next.
>>>
>>> I recently learned about Apache Arrow, a data layer that Spark currently
>>> or
>>> will someday share with Pandas, Impala, etc. Suppose that I can fill a
>>> buffer (such as a direct ByteBuffer) with Arrow-formatted data, is there
>>> an
>>> easy--- or even zero-copy--- way to use that in Spark? Is that an API
>>> that
>>> could be developed?
>>>
>>> I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good place
>>> to
>>> ask this question?
>>>
>>> Thanks,
>>> -- Jim
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-developers
>>> -list.1001551.n3.nabble.com/Apache-Arrow-data-in-buffer-to-
>>> RDD-DataFrame-Dataset-tp18563.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Jim Pivarski
I see. I've already started working with Arrow-C++ and talking to members
of the Arrow community, so I'll keep doing that.

As a follow-up question, is there an approximate timescale for when Spark
will support Arrow? I'd just like to know that all the pieces will come
together eventually.

(In this forum, most of the discussion about Arrow is about PySpark and
Pandas, not Spark in general.)

Best,
Jim

On Aug 5, 2016 2:43 PM, "Holden Karau"  wrote:

> Spark does not currently support Apache Arrow - probably a good place to
> chat would be on the Arrow mailing list where they are making progress
> towards unified JVM & Python/R support which is sort of a precondition of a
> functioning Arrow interface between Spark and Python.
>
> On Fri, Aug 5, 2016 at 12:40 PM, jpivar...@gmail.com 
> wrote:
>
>> In a few earlier posts [ 1
>> > Tungsten-off-heap-memory-access-for-C-libraries-td13898.html>
>> ] [ 2
>> > How-to-access-the-off-heap-representation-of-cached-data-
>> in-Spark-2-0-td17701.html>
>> ], I asked about moving data from C++ into a Spark data source (RDD,
>> DataFrame, or Dataset). The issue is that even the off-heap cache might
>> not
>> have a stable representation: it might change from one version to the
>> next.
>>
>> I recently learned about Apache Arrow, a data layer that Spark currently
>> or
>> will someday share with Pandas, Impala, etc. Suppose that I can fill a
>> buffer (such as a direct ByteBuffer) with Arrow-formatted data, is there
>> an
>> easy--- or even zero-copy--- way to use that in Spark? Is that an API that
>> could be developed?
>>
>> I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good place
>> to
>> ask this question?
>>
>> Thanks,
>> -- Jim
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/Apache-Arrow-data-in-buffer-
>> to-RDD-DataFrame-Dataset-tp18563.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Jeremy Smith
If you had a persistent, off-heap buffer of Arrow data on each executor,
and you could get an iterator over that buffer from inside of a task, then
you could conceivably define an RDD over it by just extending RDD and
returning the iterator from the compute method.  If you want to make a
Dataset or DataFrame, though, it's going to be tough to avoid copying the
data.  You can't avoid Spark copying data into InternalRows unless your RDD
is an RDD[InternalRow] and you create a BaseRelation for it that specifies
needsConversion = false.  It might be possible to implement InternalRow
over your Arrow buffer, but I'm still fuzzy on whether nor not that would
prevent copying/marshaling of the data.  Maybe one of the actual
contributors on Spark SQL will chime in with deeper knowledge.

Jeremy

On Fri, Aug 5, 2016 at 12:43 PM, Holden Karau  wrote:

> Spark does not currently support Apache Arrow - probably a good place to
> chat would be on the Arrow mailing list where they are making progress
> towards unified JVM & Python/R support which is sort of a precondition of a
> functioning Arrow interface between Spark and Python.
>
> On Fri, Aug 5, 2016 at 12:40 PM, jpivar...@gmail.com 
> wrote:
>
>> In a few earlier posts [ 1
>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-td13898.html
>> >
>> ] [ 2
>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-td17701.html
>> >
>> ], I asked about moving data from C++ into a Spark data source (RDD,
>> DataFrame, or Dataset). The issue is that even the off-heap cache might
>> not
>> have a stable representation: it might change from one version to the
>> next.
>>
>> I recently learned about Apache Arrow, a data layer that Spark currently
>> or
>> will someday share with Pandas, Impala, etc. Suppose that I can fill a
>> buffer (such as a direct ByteBuffer) with Arrow-formatted data, is there
>> an
>> easy--- or even zero-copy--- way to use that in Spark? Is that an API that
>> could be developed?
>>
>> I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good place
>> to
>> ask this question?
>>
>> Thanks,
>> -- Jim
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Arrow-data-in-buffer-to-RDD-DataFrame-Dataset-tp18563.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Holden Karau
Spark does not currently support Apache Arrow - probably a good place to
chat would be on the Arrow mailing list where they are making progress
towards unified JVM & Python/R support which is sort of a precondition of a
functioning Arrow interface between Spark and Python.

On Fri, Aug 5, 2016 at 12:40 PM, jpivar...@gmail.com 
wrote:

> In a few earlier posts [ 1
>  nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-td13898.html>
> ] [ 2
>  nabble.com/How-to-access-the-off-heap-representation-of-
> cached-data-in-Spark-2-0-td17701.html>
> ], I asked about moving data from C++ into a Spark data source (RDD,
> DataFrame, or Dataset). The issue is that even the off-heap cache might not
> have a stable representation: it might change from one version to the next.
>
> I recently learned about Apache Arrow, a data layer that Spark currently or
> will someday share with Pandas, Impala, etc. Suppose that I can fill a
> buffer (such as a direct ByteBuffer) with Arrow-formatted data, is there an
> easy--- or even zero-copy--- way to use that in Spark? Is that an API that
> could be developed?
>
> I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good place to
> ask this question?
>
> Thanks,
> -- Jim
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Apache-Arrow-data-
> in-buffer-to-RDD-DataFrame-Dataset-tp18563.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread jpivar...@gmail.com
In a few earlier posts [ 1

 
] [ 2

 
], I asked about moving data from C++ into a Spark data source (RDD,
DataFrame, or Dataset). The issue is that even the off-heap cache might not
have a stable representation: it might change from one version to the next.

I recently learned about Apache Arrow, a data layer that Spark currently or
will someday share with Pandas, Impala, etc. Suppose that I can fill a
buffer (such as a direct ByteBuffer) with Arrow-formatted data, is there an
easy--- or even zero-copy--- way to use that in Spark? Is that an API that
could be developed?

I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good place to
ask this question?

Thanks,
-- Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Arrow-data-in-buffer-to-RDD-DataFrame-Dataset-tp18563.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Nicholas Chammas
Okie doke, I've filed a JIRA for this here:
https://issues.apache.org/jira/browse/SPARK-16921

On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin  wrote:

> Sounds like a great idea!
>
> On Friday, August 5, 2016, Nicholas Chammas 
> wrote:
>
>> Context managers
>> 
>> are a natural way to capture closely related setup and teardown code in
>> Python.
>>
>> For example, they are commonly used when doing file I/O:
>>
>> with open('/path/to/file') as f:
>> contents = f.read()
>> ...
>>
>> Once the program exits the with block, f is automatically closed.
>>
>> Does it make sense to apply this pattern to persisting and unpersisting
>> DataFrames and RDDs? I feel like there are many cases when you want to
>> persist a DataFrame for a specific set of operations and then unpersist it
>> immediately afterwards.
>>
>> For example, take model training. Today, you might do something like this:
>>
>> labeled_data.persist()
>> model = pipeline.fit(labeled_data)
>> labeled_data.unpersist()
>>
>> If persist() returned a context manager, you could rewrite this as
>> follows:
>>
>> with labeled_data.persist():
>> model = pipeline.fit(labeled_data)
>>
>> Upon exiting the with block, labeled_data would automatically be
>> unpersisted.
>>
>> This can be done in a backwards-compatible way since persist() would
>> still return the parent DataFrame or RDD as it does today, but add two
>> methods to the object: __enter__() and __exit__()
>>
>> Does this make sense? Is it attractive?
>>
>> Nick
>> ​
>>
>


Re: Result code of whole stage codegen

2016-08-05 Thread Maciej Bryński
Thank you.
That was it.

Regards,
Maciek

2016-08-05 10:06 GMT+02:00 Herman van Hövell tot Westerflier <
hvanhov...@databricks.com>:

> Do you want to see the code that whole stage codegen produces?
>
> You can prepend a SQL statement with EXPLAIN CODEGEN ...
>
> Or you can add the following code to a DataFrame/Dataset command:
>
> import org.apache.spark.sql.execution.debug._
>
> and call the the debugCodegen() command on a Dataframe/Dataset, for
> example:
>
> range(0, 100).debugCodegen
>
> ...
>
> Found 1 WholeStageCodegen subtrees.
>
> == Subtree 1 / 1 ==
>
> *Range (0, 100, splits=8)
>
>
> Generated code:
>
> /* 001 */ public Object generate(Object[] references) {
>
> /* 002 */   return new GeneratedIterator(references);
>
> /* 003 */ }
>
> /* 004 */
>
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator
> {
>
> /* 006 */   private Object[] references;
>
> /* 007 */   private org.apache.spark.sql.execution.metric.SQLMetric
> range_numOutputRows;
>
> /* 008 */   private boolean range_initRange;
>
> /* 009 */   private long range_partitionEnd;
>
> ...
>
> On Fri, Aug 5, 2016 at 9:55 AM, Maciej Bryński  wrote:
>
>> Hi,
>> I have some operation on DataFrame / Dataset.
>> How can I see source code for whole stage codegen ?
>> Is there any API for this ? Or maybe I should configure log4j in specific
>> way ?
>>
>> Regards,
>> --
>> Maciek Bryński
>>
>
>


-- 
Maciek Bryński


Re: Spark SQL and Kryo registration

2016-08-05 Thread Maciej Bryński
Hi Olivier,
Did you check performance of Kryo ?
I have observations that Kryo is slightly slower than Java Serializer.

Regards,
Maciek

2016-08-04 17:41 GMT+02:00 Amit Sela :

> It should. Codegen uses the SparkConf in SparkEnv when instantiating a new
> Serializer.
>
> On Thu, Aug 4, 2016 at 6:14 PM Jacek Laskowski  wrote:
>
>> Hi Olivier,
>>
>> I don't know either, but am curious what you've tried already.
>>
>> Jacek
>>
>> On 3 Aug 2016 10:50 a.m., "Olivier Girardot" <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> I'm currently to use Spark 2.0.0 and making Dataframes work with kryo.
>>> registrationRequired=true
>>> Is it even possible at all considering the codegen ?
>>>
>>> Regards,
>>>
>>> *Olivier Girardot* | Associé
>>> o.girar...@lateral-thoughts.com
>>> +33 6 24 09 17 94
>>>
>>


-- 
Maciek Bryński


Re: Result code of whole stage codegen

2016-08-05 Thread Herman van Hövell tot Westerflier
Do you want to see the code that whole stage codegen produces?

You can prepend a SQL statement with EXPLAIN CODEGEN ...

Or you can add the following code to a DataFrame/Dataset command:

import org.apache.spark.sql.execution.debug._

and call the the debugCodegen() command on a Dataframe/Dataset, for example:

range(0, 100).debugCodegen

...

Found 1 WholeStageCodegen subtrees.

== Subtree 1 / 1 ==

*Range (0, 100, splits=8)


Generated code:

/* 001 */ public Object generate(Object[] references) {

/* 002 */   return new GeneratedIterator(references);

/* 003 */ }

/* 004 */

/* 005 */ final class GeneratedIterator extends
org.apache.spark.sql.execution.BufferedRowIterator {

/* 006 */   private Object[] references;

/* 007 */   private org.apache.spark.sql.execution.metric.SQLMetric
range_numOutputRows;

/* 008 */   private boolean range_initRange;

/* 009 */   private long range_partitionEnd;

...

On Fri, Aug 5, 2016 at 9:55 AM, Maciej Bryński  wrote:

> Hi,
> I have some operation on DataFrame / Dataset.
> How can I see source code for whole stage codegen ?
> Is there any API for this ? Or maybe I should configure log4j in specific
> way ?
>
> Regards,
> --
> Maciek Bryński
>


Result code of whole stage codegen

2016-08-05 Thread Maciej Bryński
Hi,
I have some operation on DataFrame / Dataset.
How can I see source code for whole stage codegen ?
Is there any API for this ? Or maybe I should configure log4j in specific
way ?

Regards,
-- 
Maciek Bryński