date:20180820

I'm not sure I follow what the discussion topic is here

> For example, a Cassandra catalog or a JDBC catalog that exposes tables in
those systems will definitely not support users marking tables with the
“parquet” data source.

I don't understand why a Cassandra Catalogue wouldn't be able to store
metadata references for a parquet table just as a Hive Catalogue can store
references to a C* datastource. We currently store HiveMetastore data in a
C* table and this allows us to store tables with any underlying format even
though the catalogues' implantation is written in C*.

Is the idea here that a table can't have multiple underlying formats in a
given catalogue? And the USING can then be used on read to force a
particular format?

> I think other catalogs should be able to choose what to do with the USING
 string

This makes sense to me, but i'm not sure why any catalogue would want to
ignore this?

It would be helpful to me to have a few examples written out if that is
possible with Old Implementation and New Implementation

Thanks for your time,
Russ

On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue 
wrote:

> Thanks for posting this discussion to the dev list, it would be great to
> hear what everyone thinks about the idea that USING should be a
> catalog-specific storage configuration.
>
> Related to this, I’ve updated the catalog PR, #21306
> , to include an
> implementation that translates from the v2 TableCatalog API
> 
> to the current catalog API. That shows how this would fit together with v1,
> at least for the catalog part. This will enable all of the new query plans
> to be written to the TableCatalog API, even if they end up using the
> default v1 catalog.
>
> On Mon, Aug 20, 2018 at 12:19 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I have been trying to follow `USING` syntax support since that looks
>> currently not supported whereas `format` API supports this. I have been
>> trying to understand why and talked with Ryan.
>>
>> Ryan knows all the details and, He and I thought it's good to post here -
>> I just started to look into this.
>> Here is Ryan's response:
>>
>>
>> >USING is currently used to select the underlying data source
>> implementation directly. The string passed in USING or format in the DF
>> API is used to resolve an implementation class.
>>
>> The existing catalog supports tables that specify their datasource
>> implementation, but this will not be the case for all catalogs when Spark
>> adds multiple catalog support. For example, a Cassandra catalog or a JDBC
>> catalog that exposes tables in those systems will definitely not support
>> users marking tables with the “parquet” data source. The catalog must have
>> the ability to determine the data source implementation. That’s why I think
>> it is valuable to think of the current ExternalCatalog as one that can
>> track tables with any read/write implementation. Other catalogs can’t and
>> won’t do that.
>>
>> > In the catalog v2 API  I’ve
>> proposed, everything from CREATE TABLE is passed to the catalog. Then
>> the catalog determines what source to use and returns a Table instance
>> that uses some class for its ReadSupport and WriteSupport implementation.
>> An ExternalCatalog exposed through that API would receive the USING or
>> format string as a table property and would return a Table that uses the
>> correct ReadSupport, so tables stored in an ExternalCatalog will work as
>> they do today.
>>
>> > I think other catalogs should be able to choose what to do with the
>> USING string. An Iceberg  catalog
>> might use this to determine the underlying file format, which could be
>> parquet, orc, or avro. Or, a JDBC catalog might use it for the
>> underlying table implementation in the DB. This would make the property
>> more of a storage hint for the catalog, which is going to determine the
>> read/write implementation anyway.
>>
>> > For cases where there is no catalog involved, the current plan is to
>> use the reflection-based approach from v1 with the USING or format string.
>> In v2, that should resolve a ReadSupportProvider, which is used to
>> create a ReadSupport directly from options. I think this is a good
>> approach for backward-compatibility, but it can’t provide the same features
>> as a catalog-based table. Catalogs are how we have decided to build
>> reliable behavior for CTAS and the other standard logical plans
>> .
>> CTAS is a create and then an insert, and a write implementation alone can’t
>> provide that create operation.
>>
>> I was targeting the last case (where there is no catalog involved) in
>> particular. I was thinking that approach is also good since `USING` syntax

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread Manu Zhang

Hi Makatun,

For 2, I guess `cache` will break up the logical plan and force it be
analyzed.
For 3, I have a similar observation here
https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015.
Each `withColumn` will force the logical plan to be analyzed which is not
free. There is `RuleExecutor.dumpTimeSpent` that prints analysis time and
turning on DEBUG log will also give you much more info.

Thanks,
Manu Zhang

On Mon, Aug 20, 2018 at 10:25 PM antonkulaga  wrote:

> makatun, did you try to test somewhing more complex, like
> dataframe.describe
> or PCA?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Spark Kafka adapter questions

2018-08-20 Thread Ted Yu

After a brief check, I found KAFKA-5649 where almost identical error was
reported.

There is also KAFKA-3702 which is related but currently open.

I will dig some more to see what I can find.

Cheers

On Mon, Aug 20, 2018 at 3:53 PM Basil Hariri 
wrote:

> I am pretty sure I got those changes with the jar I compiled (I pulled
> from master on 8/8 and it looks like SPARK-18057 was resolved on 8/3) but
> no luck, here is a copy-paste of the error I’m seeing. The semantics for
> Event Hubs’ Kafka head is highlighted for reference – we connect to port
> 9093 on a FQDN instead of port 9092 on a Kafka broker’s IP address, but I
> don’t think that should change anything.
>
>
>
>
>
> 18/08/20 22:29:13 INFO AbstractCoordinator: Discovered coordinator  FQDN>:9093 (id: 2147483647 rack: null) for group
> spark-kafka-source-1aa50598-99d1-4c53-a73c-fa6637a219b2--1338794993-driver-0.
>
> 18/08/20 22:29:13 INFO AbstractCoordinator: (Re-)joining group
> spark-kafka-source-1aa50598-99d1-4c53-a73c-fa6637a219b2--1338794993-driver-0
>
> 18/08/20 22:29:33 WARN SslTransportLayer: Failed to send SSL Close message
>
> java.io.IOException: Broken pipe
>
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>
> at
> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>
> at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>
> at
> kafkashaded.org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:195)
>
> at
> kafkashaded.org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:163)
>
> at
> kafkashaded.org.apache.kafka.common.utils.Utils.closeAll(Utils.java:690)
>
> at
> kafkashaded.org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:47)
>
> at
> kafkashaded.org.apache.kafka.common.network.Selector.close(Selector.java:471)
>
> at
> kafkashaded.org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:348)
>
> at
> kafkashaded.org.apache.kafka.common.network.Selector.poll(Selector.java:283)
>
> at
> kafkashaded.org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:260)
>
> at
> kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
>
> at
> kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
>
> at
> kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
>
> at
> kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
>
> at
> kafkashaded.org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:243)
>
> at
> kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366)
>
> at
> kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978)
>
> at
> kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
>
> at
> org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1$$anonfun$apply$9.apply(KafkaOffsetReader.scala:214)
>
> at
> org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1$$anonfun$apply$9.apply(KafkaOffsetReader.scala:212)
>
> at
> org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt$1.apply$mcV$sp(KafkaOffsetReader.scala:303)
>
> at
> org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt$1.apply(KafkaOffsetReader.scala:302)
>
> at
> org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt$1.apply(KafkaOffsetReader.scala:302)
>
> at
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>
> at org.apache.spark.sql.kafka010.KafkaOffsetReader.org
> $apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt(KafkaOffsetReader.scala:301)
>
> at
> org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1.apply(KafkaOffsetReader.scala:212)
>
> at
> org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1.apply(KafkaOffsetReader.scala:212)
>
>

RE: Spark Kafka adapter questions

2018-08-20 Thread Basil Hariri

I am pretty sure I got those changes with the jar I compiled (I pulled from 
master on 8/8 and it looks like SPARK-18057 was resolved on 8/3) but no luck, 
here is a copy-paste of the error I’m seeing. The semantics for Event Hubs’ 
Kafka head is highlighted for reference – we connect to port 9093 on a FQDN 
instead of port 9092 on a Kafka broker’s IP address, but I don’t think that 
should change anything.


18/08/20 22:29:13 INFO AbstractCoordinator: Discovered coordinator :9093 (id: 2147483647 rack: null) for group 
spark-kafka-source-1aa50598-99d1-4c53-a73c-fa6637a219b2--1338794993-driver-0.
18/08/20 22:29:13 INFO AbstractCoordinator: (Re-)joining group 
spark-kafka-source-1aa50598-99d1-4c53-a73c-fa6637a219b2--1338794993-driver-0
18/08/20 22:29:33 WARN SslTransportLayer: Failed to send SSL Close message
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at 
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at 
kafkashaded.org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:195)
at 
kafkashaded.org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:163)
at 
kafkashaded.org.apache.kafka.common.utils.Utils.closeAll(Utils.java:690)
at 
kafkashaded.org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:47)
at 
kafkashaded.org.apache.kafka.common.network.Selector.close(Selector.java:471)
at 
kafkashaded.org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:348)
at 
kafkashaded.org.apache.kafka.common.network.Selector.poll(Selector.java:283)
at 
kafkashaded.org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:260)
at 
kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
at 
kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
at 
kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
at 
kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
at 
kafkashaded.org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:243)
at 
kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366)
at 
kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978)
at 
kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1$$anonfun$apply$9.apply(KafkaOffsetReader.scala:214)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1$$anonfun$apply$9.apply(KafkaOffsetReader.scala:212)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt$1.apply$mcV$sp(KafkaOffsetReader.scala:303)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt$1.apply(KafkaOffsetReader.scala:302)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt$1.apply(KafkaOffsetReader.scala:302)
at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader.org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt(KafkaOffsetReader.scala:301)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1.apply(KafkaOffsetReader.scala:212)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1.apply(KafkaOffsetReader.scala:212)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader.runUninterruptibly(KafkaOffsetReader.scala:270)
at 
org.apache.spark.sql.kafka010.KafkaOffsetReader.fetchLatestOffsets(KafkaOffsetReader.scala:211)
at 
org.apache.spark.sql.kafka010.KafkaMicroBatchReader$$anonfun$getOrCreateInitialPartitionOffsets$1.apply(KafkaMicroBatchReader.scala:212)
at

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

2018-08-20 Thread Li Jin

Thanks for looking into this Shane. If we can only have a single python 3
version, I agree 3.6 would be better than 3.5. Otherwise, ideally I think
it would be nice to test all supported 3.x versions (latest micros should
be fine).

On Mon, Aug 20, 2018 at 7:07 PM shane knapp  wrote:

> initially, i'd like to just choose one version to have the primary tests
> against, but i'm also not opposed to supporting more of a matrix.  the
> biggest problem i see w/this approach, however, is that of build monitoring
> and long-term ownership.  this is why we have a relatively restrictive
> current deployment.
>
> another thing i've been noticing during this project is that we have a lot
> of flaky tests...  for instance, i'm literally having every other build
> fail on my (relatively) up-to-date PRB fork:
> https://amplab.cs.berkeley.edu/jenkins/job/ubuntuSparkPRB/
>
> (i'm testing more than python here, otherwise i could just build a spark
> distro and run the python tests against that)
>
> i'll also set up a 3.6 env tomorrow and start testing against that.  i'm
> pretty confident about 3.5, tho.
>
> shane
>
> On Mon, Aug 20, 2018 at 11:33 AM, Bryan Cutler  wrote:
>
>> Thanks for looking into this Shane!  If we are choosing a single python
>> 3.x, I think 3.6 would be good. It might still be nice to test against
>> other versions too, so we can catch any issues. Is it possible to have more
>> exhaustive testing as part of a nightly or scheduled build? As a point of
>> reference for Python 3.6, Arrow is using this version for CI.
>>
>> Bryan
>>
>> On Sun, Aug 19, 2018 at 9:49 PM Hyukjin Kwon  wrote:
>>
>>> Actually Python 3.7 is released (
>>> https://www.python.org/downloads/release/python-370/) too and I fixed
>>> the compatibility issues accordingly -
>>> https://github.com/apache/spark/pull/21714
>>> There has been an issue for 3.6 (comparing to lower versions of Python
>>> including 3.5) - https://github.com/apache/spark/pull/16429
>>>
>>> I am not yet sure what's the best matrix for it actually. In case of R,
>>> we test lowest version in Jenkins and highest version via AppVeyor FWIW.
>>> I don't have a strong preference opinion on this since we have been
>>> having compatibility issues for each Python version.
>>>
>>>
>>> 2018년 8월 14일 (화) 오전 4:15, shane knapp 님이 작성:
>>>
 hey everyone!

 i was checking out the EOL/release cycle for python 3.5 and it looks
 like we'll have 3.5.6 released in early 2019.

 this got me to thinking:  instead of 3.5, what about 3.6?

 i looked around, and according to the 'docs' and 'collective wisdom of
 the internets', 3.5 and 3.6 should be fully backwards-compatible w/3.4.

 of course, this needs to be taken w/a grain of salt, as we're mostly
 focused on actual python package requirements, rather than worrying about
 core python functionality.

 thoughts?  comments?

 thanks in advance,

 shane
 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

2018-08-20 Thread shane knapp

initially, i'd like to just choose one version to have the primary tests
against, but i'm also not opposed to supporting more of a matrix.  the
biggest problem i see w/this approach, however, is that of build monitoring
and long-term ownership.  this is why we have a relatively restrictive
current deployment.

another thing i've been noticing during this project is that we have a lot
of flaky tests...  for instance, i'm literally having every other build
fail on my (relatively) up-to-date PRB fork:
https://amplab.cs.berkeley.edu/jenkins/job/ubuntuSparkPRB/

(i'm testing more than python here, otherwise i could just build a spark
distro and run the python tests against that)

i'll also set up a 3.6 env tomorrow and start testing against that.  i'm
pretty confident about 3.5, tho.

shane

On Mon, Aug 20, 2018 at 11:33 AM, Bryan Cutler  wrote:

> Thanks for looking into this Shane!  If we are choosing a single python
> 3.x, I think 3.6 would be good. It might still be nice to test against
> other versions too, so we can catch any issues. Is it possible to have more
> exhaustive testing as part of a nightly or scheduled build? As a point of
> reference for Python 3.6, Arrow is using this version for CI.
>
> Bryan
>
> On Sun, Aug 19, 2018 at 9:49 PM Hyukjin Kwon  wrote:
>
>> Actually Python 3.7 is released (https://www.python.org/
>> downloads/release/python-370/) too and I fixed the compatibility issues
>> accordingly - https://github.com/apache/spark/pull/21714
>> There has been an issue for 3.6 (comparing to lower versions of Python
>> including 3.5) - https://github.com/apache/spark/pull/16429
>>
>> I am not yet sure what's the best matrix for it actually. In case of R,
>> we test lowest version in Jenkins and highest version via AppVeyor FWIW.
>> I don't have a strong preference opinion on this since we have been
>> having compatibility issues for each Python version.
>>
>>
>> 2018년 8월 14일 (화) 오전 4:15, shane knapp 님이 작성:
>>
>>> hey everyone!
>>>
>>> i was checking out the EOL/release cycle for python 3.5 and it looks
>>> like we'll have 3.5.6 released in early 2019.
>>>
>>> this got me to thinking:  instead of 3.5, what about 3.6?
>>>
>>> i looked around, and according to the 'docs' and 'collective wisdom of
>>> the internets', 3.5 and 3.6 should be fully backwards-compatible w/3.4.
>>>
>>> of course, this needs to be taken w/a grain of salt, as we're mostly
>>> focused on actual python package requirements, rather than worrying about
>>> core python functionality.
>>>
>>> thoughts?  comments?
>>>
>>> thanks in advance,
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

2018-08-20 Thread Bryan Cutler

Thanks for looking into this Shane!  If we are choosing a single python
3.x, I think 3.6 would be good. It might still be nice to test against
other versions too, so we can catch any issues. Is it possible to have more
exhaustive testing as part of a nightly or scheduled build? As a point of
reference for Python 3.6, Arrow is using this version for CI.

Bryan

On Sun, Aug 19, 2018 at 9:49 PM Hyukjin Kwon  wrote:

> Actually Python 3.7 is released (
> https://www.python.org/downloads/release/python-370/) too and I fixed the
> compatibility issues accordingly -
> https://github.com/apache/spark/pull/21714
> There has been an issue for 3.6 (comparing to lower versions of Python
> including 3.5) - https://github.com/apache/spark/pull/16429
>
> I am not yet sure what's the best matrix for it actually. In case of R, we
> test lowest version in Jenkins and highest version via AppVeyor FWIW.
> I don't have a strong preference opinion on this since we have been having
> compatibility issues for each Python version.
>
>
> 2018년 8월 14일 (화) 오전 4:15, shane knapp 님이 작성:
>
>> hey everyone!
>>
>> i was checking out the EOL/release cycle for python 3.5 and it looks like
>> we'll have 3.5.6 released in early 2019.
>>
>> this got me to thinking:  instead of 3.5, what about 3.6?
>>
>> i looked around, and according to the 'docs' and 'collective wisdom of
>> the internets', 3.5 and 3.6 should be fully backwards-compatible w/3.4.
>>
>> of course, this needs to be taken w/a grain of salt, as we're mostly
>> focused on actual python package requirements, rather than worrying about
>> core python functionality.
>>
>> thoughts?  comments?
>>
>> thanks in advance,
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

Re: best way to run one python test?

2018-08-20 Thread Hyukjin Kwon

In my experience it's usually okay but it's still informal way tho as far
as I can tell.

On Mon, 20 Aug 2018, 11:24 pm Imran Rashid,  wrote:

> thanks, that helps!
>
> So run-tests.py adds a bunch more env variables:
> https://github.com/apache/spark/blob/master/python/run-tests.py#L74-L97
>
> those don't matter in most cases I guess?
>
> On Sun, Aug 19, 2018 at 11:54 PM Hyukjin Kwon  wrote:
>
>> There's informal way to test specific tests. For instance:
>>
>> SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
>>
>> I have a partial fix for our testing script to support this way in my
>> local but couldn't have enough time to make a PR for it yet.
>>
>>
>> 2018년 8월 20일 (월) 오전 11:08, Imran Rashid 님이
>> 작성:
>>
>>> Hi,
>>>
>>> I haven't spent a lot of time working on the python side of spark before
>>> so apologize if this is a basic question, but I'm trying to figure out the
>>> best way to run a small subset of python tests in a tight loop while
>>> developing.  The closer I can get to sbt's "~test-only *FooSuite -- -z
>>> test-blah" the better.
>>>
>>> I'm familiar with the "--modules" in python/run-tests, but even running
>>> one module takes a long time when I want to just run one teeny test
>>> repeatedly.  Is there a way to run just one file?  And a way to run only
>>> one test within a file?
>>>
>>> So far, I know I can assembly my own command line like run-tests does,
>>> with all the env vars like PYSPARK_SUBMIT_ARGS etc. and just pass in one
>>> test file.  Seems tedious.  Would it be helpful to add a "--single-test"
>>> option (or something) to run-tests.py?
>>>
>>> And for running one test within a file, I know for the unit test files
>>> (like tests.py), I could modify the "main" section to have it run just one
>>> test, but would be nice to be able to do that from the command line.
>>> (maybe there is something similar for doctests, not sure.)  Again, could
>>> add a command line option to run-tests for that, though would be more work
>>> to plumb it through to each suite.
>>>
>>> thanks,
>>> Imran
>>>
>>

Re: [DISCUSS] USING syntax for Datasource V2

2018-08-20 Thread Ryan Blue

Thanks for posting this discussion to the dev list, it would be great to
hear what everyone thinks about the idea that USING should be a
catalog-specific storage configuration.

Related to this, I’ve updated the catalog PR, #21306
, to include an implementation
that translates from the v2 TableCatalog API

to the current catalog API. That shows how this would fit together with v1,
at least for the catalog part. This will enable all of the new query plans
to be written to the TableCatalog API, even if they end up using the
default v1 catalog.

On Mon, Aug 20, 2018 at 12:19 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I have been trying to follow `USING` syntax support since that looks
> currently not supported whereas `format` API supports this. I have been
> trying to understand why and talked with Ryan.
>
> Ryan knows all the details and, He and I thought it's good to post here -
> I just started to look into this.
> Here is Ryan's response:
>
>
> >USING is currently used to select the underlying data source
> implementation directly. The string passed in USING or format in the DF
> API is used to resolve an implementation class.
>
> The existing catalog supports tables that specify their datasource
> implementation, but this will not be the case for all catalogs when Spark
> adds multiple catalog support. For example, a Cassandra catalog or a JDBC
> catalog that exposes tables in those systems will definitely not support
> users marking tables with the “parquet” data source. The catalog must have
> the ability to determine the data source implementation. That’s why I think
> it is valuable to think of the current ExternalCatalog as one that can
> track tables with any read/write implementation. Other catalogs can’t and
> won’t do that.
>
> > In the catalog v2 API  I’ve
> proposed, everything from CREATE TABLE is passed to the catalog. Then the
> catalog determines what source to use and returns a Table instance that
> uses some class for its ReadSupport and WriteSupport implementation. An
> ExternalCatalog exposed through that API would receive the USING or format 
> string
> as a table property and would return a Table that uses the correct
> ReadSupport, so tables stored in an ExternalCatalog will work as they do
> today.
>
> > I think other catalogs should be able to choose what to do with the
> USING string. An Iceberg  catalog
> might use this to determine the underlying file format, which could be
> parquet, orc, or avro. Or, a JDBC catalog might use it for the underlying
> table implementation in the DB. This would make the property more of a
> storage hint for the catalog, which is going to determine the read/write
> implementation anyway.
>
> > For cases where there is no catalog involved, the current plan is to use
> the reflection-based approach from v1 with the USING or format string. In
> v2, that should resolve a ReadSupportProvider, which is used to create a
> ReadSupport directly from options. I think this is a good approach for
> backward-compatibility, but it can’t provide the same features as a
> catalog-based table. Catalogs are how we have decided to build reliable
> behavior for CTAS and the other standard logical plans
> .
> CTAS is a create and then an insert, and a write implementation alone can’t
> provide that create operation.
>
> I was targeting the last case (where there is no catalog involved) in
> particular. I was thinking that approach is also good since `USING` syntax
> compatibility should be kept anyway - this should reduce migration cost as
> well. Was wondering about what you guys think about this.
> If you guys could think the last case should be supported anyway, I was
> thinking we could just orthogonally proceed. If you guys think other issues
> should be resolved first, I think we (at least I will) should take a look
> for the set of catalog APIs.
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: best way to run one python test?

2018-08-20 Thread Imran Rashid

thanks, that helps!

So run-tests.py adds a bunch more env variables:
https://github.com/apache/spark/blob/master/python/run-tests.py#L74-L97

those don't matter in most cases I guess?

On Sun, Aug 19, 2018 at 11:54 PM Hyukjin Kwon  wrote:

> There's informal way to test specific tests. For instance:
>
> SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
>
> I have a partial fix for our testing script to support this way in my
> local but couldn't have enough time to make a PR for it yet.
>
>
> 2018년 8월 20일 (월) 오전 11:08, Imran Rashid 님이
> 작성:
>
>> Hi,
>>
>> I haven't spent a lot of time working on the python side of spark before
>> so apologize if this is a basic question, but I'm trying to figure out the
>> best way to run a small subset of python tests in a tight loop while
>> developing.  The closer I can get to sbt's "~test-only *FooSuite -- -z
>> test-blah" the better.
>>
>> I'm familiar with the "--modules" in python/run-tests, but even running
>> one module takes a long time when I want to just run one teeny test
>> repeatedly.  Is there a way to run just one file?  And a way to run only
>> one test within a file?
>>
>> So far, I know I can assembly my own command line like run-tests does,
>> with all the env vars like PYSPARK_SUBMIT_ARGS etc. and just pass in one
>> test file.  Seems tedious.  Would it be helpful to add a "--single-test"
>> option (or something) to run-tests.py?
>>
>> And for running one test within a file, I know for the unit test files
>> (like tests.py), I could modify the "main" section to have it run just one
>> test, but would be nice to be able to do that from the command line.
>> (maybe there is something similar for doctests, not sure.)  Again, could
>> add a command line option to run-tests for that, though would be more work
>> to plumb it through to each suite.
>>
>> thanks,
>> Imran
>>
>

Re: Why repartitionAndSortWithinPartitions slower than MapReducer

2018-08-20 Thread 周浥尘

In addition to my previous email,
Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6

周浥尘  于2018年8月20日周一 下午8:52写道：

> Hi team,
>
> I found the Spark method *repartitionAndSortWithinPartitions *spends
> twice as much time as using Mapreduce in some cases.
> I want to repartition the dataset accorading to split keys and save them
> to files in ascending. As the doc says,
> repartitionAndSortWithinPartitions “is more efficient than calling
> `repartition` and then sorting within each partition because it can push
> the sorting down into the shuffle machinery.” I thought it may be faster
> than MR, but actually, it is much more slower. I also adjust several
> configurations of spark, but that doesn't work.(Both Spark and Mapreduce
> run on a three-node cluster and share the same number of partitions.)
> Can this situation be explained or is there any approach to improve the
> performance of spark?
>
> Thanks & Regards,
> Yichen
>

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread antonkulaga

makatun, did you try to test somewhing more complex, like dataframe.describe
or PCA? 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Why repartitionAndSortWithinPartitions slower than MapReducer

2018-08-20 Thread 周浥尘

Hi team,

I found the Spark method *repartitionAndSortWithinPartitions *spends twice
as much time as using Mapreduce in some cases.
I want to repartition the dataset accorading to split keys and save them to
files in ascending. As the doc says, repartitionAndSortWithinPartitions “is
more efficient than calling `repartition` and then sorting within each
partition because it can push the sorting down into the shuffle machinery.”
I thought it may be faster than MR, but actually, it is much more slower. I
also adjust several configurations of spark, but that doesn't work.(Both
Spark and Mapreduce run on a three-node cluster and share the same number
of partitions.)
Can this situation be explained or is there any approach to improve the
performance of spark?

Thanks & Regards,
Yichen

Apache Airflow (incubator) PMC binding vote needed

2018-08-20 Thread t4

Hi

Can any member of Apache Incubator PMC provide a vote for Apache Airflow
1.10 to be released?

Thanks

https://lists.apache.org/thread.html/fb09a91f1cef4a63df4d5474e2189248aa65a609a6237d8eefcd8eb7@%3Cdev.airflow.apache.org%3E



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Unsubscribe

2018-08-20 Thread Michael Styles

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread makatun

Hi Marco,
many thanks for pointing the related Spark commit. According to the
discription, it introduces indexed (instead of linear) search over columns
in LogicalPlan.resolve(...).
We have performed the tests on the current Spark master branch and would
like to share the results. There are some good (see 1) and not so good news
(see 2 and 3).

RESULTS OF TESTING CURRENT MASTER:

1. First of all, the performance of the test jobs described in the initial
post has improved dramatically. In the new version the duration is linear on
the number of columns (observed up to 40K columns). Please, see the plot
below

 
The similar results were observed for the transfromations: filter, groupBy,
sum, withColumn, drop.  This is a huge performance improvement which is
critical to those working with wide tables, e.g. in machine learning or
importing data from legacy systems. Many thanks to the authors of this
commit. 

2. When adding caching to the test jobs (.cache() right before the .count())
the duration of jobs increases and become polynomial on the number of
columns. The plot below shows the effect of caching in both spark 2.3.1 and
2.4.0-SNAPSHOT for a better comparison.

 
The spark 2.4.0-SNAPSHOT completes the jobs faster than 2.3.1. However, the
reason for the polynomial compexity of caching on columns is not very clear. 

3. We have also performed tests with more complex transformations. Compared
to the initial test jobs, the following transformation is added:
 
 df.schema.fields.foldLeft(df)({ // iterate over initial columns
   case (accDf: DataFrame, attr: StructField) => {
   accDf.withColumn(s"${attr.name}_isNotNull",
df.col(attr.name).isNotNull) // add new column
 .drop(attr.name) // remove initial column
   }
 }).count()

It iterates over the initial columns. For each column it adds a new boolean
column indicating if the value in the initial column is not null. Then the
initial column is dropped.   
The measured job duration VS number of columns is at the plot below.

 
The duration of such jobs has significantly increased compared to Spark
2.3.1. Again, it is polynomial on the number of columns.

CONSIDERED OPTIMIZATIONS: 
a) Disabling constraint propagation

decreases the duration by 15%, but does not solve the principal problem.

b) Checkpointing after every 100 columns may decrease the time (by up to 40%
in our experiments). It prunes the linage and therefore simplifies the work
for the Catalyst optimizer. However, it comes at a high cost: the executers
have to scan over all the rows at each checkpoint. In many situations (e.g.
> 100K rows per executor, or narrow tables with < 100 columns) checkpointing
increases the overall duration. Even in the idealistic case of just a few
rows, the speed-up by checkpointing is still not enough to adrees many tens
of thousands of columns. 

CONCLUSION:
The new improvement in the upcoming spark 2.4 introduces indexed search over
columns in LogicalPlan.resolve(...). It results in a great performance
improvement in basic transformations. However, there are still some
transfromations which are problematic for wide data. In particular, .cache()
demonstrates polynomial complexity on the number of columns. The duration of
jobs featuring iteration over columns is increased compared to the current
Spark-2.3.1. There are potentially parts of code where search over columns
remaines linear. A discussion on further possible optimization is very
welcome. 







--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[DISCUSS] USING syntax for Datasource V2

2018-08-20 Thread Hyukjin Kwon

Hi all,

I have been trying to follow `USING` syntax support since that looks
currently not supported whereas `format` API supports this. I have been
trying to understand why and talked with Ryan.

Ryan knows all the details and, He and I thought it's good to post here - I
just started to look into this.
Here is Ryan's response:


>USING is currently used to select the underlying data source
implementation directly. The string passed in USING or format in the DF API
is used to resolve an implementation class.

The existing catalog supports tables that specify their datasource
implementation, but this will not be the case for all catalogs when Spark
adds multiple catalog support. For example, a Cassandra catalog or a JDBC
catalog that exposes tables in those systems will definitely not support
users marking tables with the “parquet” data source. The catalog must have
the ability to determine the data source implementation. That’s why I think
it is valuable to think of the current ExternalCatalog as one that can
track tables with any read/write implementation. Other catalogs can’t and
won’t do that.

> In the catalog v2 API  I’ve
proposed, everything from CREATE TABLE is passed to the catalog. Then the
catalog determines what source to use and returns a Table instance that
uses some class for its ReadSupport and WriteSupport implementation. An
ExternalCatalog exposed through that API would receive the USING or
format string
as a table property and would return a Table that uses the correct
ReadSupport, so tables stored in an ExternalCatalog will work as they do
today.

> I think other catalogs should be able to choose what to do with the USING 
> string.
An Iceberg  catalog might use this to
determine the underlying file format, which could be parquet, orc, or avro.
Or, a JDBC catalog might use it for the underlying table implementation in
the DB. This would make the property more of a storage hint for the
catalog, which is going to determine the read/write implementation anyway.

> For cases where there is no catalog involved, the current plan is to use
the reflection-based approach from v1 with the USING or format string. In
v2, that should resolve a ReadSupportProvider, which is used to create a
ReadSupport directly from options. I think this is a good approach for
backward-compatibility, but it can’t provide the same features as a
catalog-based table. Catalogs are how we have decided to build reliable
behavior for CTAS and the other standard logical plans
.
CTAS is a create and then an insert, and a write implementation alone can’t
provide that create operation.

I was targeting the last case (where there is no catalog involved) in
particular. I was thinking that approach is also good since `USING` syntax
compatibility should be kept anyway - this should reduce migration cost as
well. Was wondering about what you guys think about this.
If you guys could think the last case should be supported anyway, I was
thinking we could just orthogonally proceed. If you guys think other issues
should be resolved first, I think we (at least I will) should take a look
for the set of catalog APIs.

Unsubscribe

Re: [DISCUSS] USING syntax for Datasource V2

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

Re: Spark Kafka adapter questions

RE: Spark Kafka adapter questions

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

Re: best way to run one python test?

Re: [DISCUSS] USING syntax for Datasource V2

Re: best way to run one python test?

Re: Why repartitionAndSortWithinPartitions slower than MapReducer

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

Why repartitionAndSortWithinPartitions slower than MapReducer

Apache Airflow (incubator) PMC binding vote needed

Unsubscribe

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

[DISCUSS] USING syntax for Datasource V2

18 matches

Site Navigation

Mail list logo

Footer information