Re: Concurreny does not improve for Spark Jobs with Same Spark Context
Fair Scheduler, YARN Queue has the entire cluster resource as maxResource, preemption does not come into picture during test case, all the spark jobs got the requested resource. The concurrent jobs with different spark context runs fine, so suspecting on resource contention is not a correct one. The performace degrades only for concurrent jobs on shared spark context. Is SparkContext has any critical section, which needs locking, and jobs waiting to read that. I know Spark and Scala is not a old thread model, it uses Actor Model, where locking does not happen, but still want to verify is java old threading is used somewhere. On Friday, February 19, 2016, Jörn Franke wrote: > How did you configure YARN queues? What scheduler? Preemption ? > > > On 19 Feb 2016, at 06:51, Prabhu Joseph > wrote: > > > > Hi All, > > > >When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share > a single Spark Context, the jobs take more time to complete comparing with > when they ran with different Spark Context. > > The spark jobs are submitted on different threads. > > > > Test Case: > > > > A. 3 spark jobs submitted serially > > B. 3 spark jobs submitted concurrently and with different > SparkContext > > C. 3 spark jobs submitted concurrently and with same Spark Context > > D. 3 spark jobs submitted concurrently and with same Spark Context > and tripling the resources. > > > > A and B takes equal time, But C and D are taking 2-3 times longer than > A, which shows concurrency does not improve with shared Spark Context. > [Spark Job Server] > > > > Thanks, > > Prabhu Joseph >
Re: Concurreny does not improve for Spark Jobs with Same Spark Context
How did you configure YARN queues? What scheduler? Preemption ? > On 19 Feb 2016, at 06:51, Prabhu Joseph wrote: > > Hi All, > >When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a > single Spark Context, the jobs take more time to complete comparing with when > they ran with different Spark Context. > The spark jobs are submitted on different threads. > > Test Case: > > A. 3 spark jobs submitted serially > B. 3 spark jobs submitted concurrently and with different SparkContext > C. 3 spark jobs submitted concurrently and with same Spark Context > D. 3 spark jobs submitted concurrently and with same Spark Context and > tripling the resources. > > A and B takes equal time, But C and D are taking 2-3 times longer than A, > which shows concurrency does not improve with shared Spark Context. [Spark > Job Server] > > Thanks, > Prabhu Joseph - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Concurreny does not improve for Spark Jobs with Same Spark Context
Hi All, When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a single Spark Context, the jobs take more time to complete comparing with when they ran with different Spark Context. The spark jobs are submitted on different threads. Test Case: A. 3 spark jobs submitted serially B. 3 spark jobs submitted concurrently and with different SparkContext C. 3 spark jobs submitted concurrently and with same Spark Context D. 3 spark jobs submitted concurrently and with same Spark Context and tripling the resources. A and B takes equal time, But C and D are taking 2-3 times longer than A, which shows concurrency does not improve with shared Spark Context. [Spark Job Server] Thanks, Prabhu Joseph
Re: How to run PySpark tests?
Great - I'll update the wiki. On Thu, Feb 18, 2016 at 8:34 PM, Jason White wrote: > Compiling with `build/mvn -Pyarn -Phadoop-2.4 -Phive -Dhadoop.version=2.4.0 > -DskipTests clean package` followed by `python/run-tests` seemed to do the > trick! Thanks! > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357p16362.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau
Re: How to run PySpark tests?
Compiling with `build/mvn -Pyarn -Phadoop-2.4 -Phive -Dhadoop.version=2.4.0 -DskipTests clean package` followed by `python/run-tests` seemed to do the trick! Thanks! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357p16362.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: How to run PySpark tests?
I've run into some problems with the Python tests in the past when I haven't built with hive support, you might want to build your assembly with hive support and see if that helps. On Thursday, February 18, 2016, Jason White wrote: > Hi, > > I'm trying to finish up a PR (https://github.com/apache/spark/pull/10089) > which is currently failing PySpark tests. The instructions to run the test > suite seem a little dated. I was able to find these: > https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals > http://spark.apache.org/docs/latest/building-spark.html > > I've tried running `python/run-tests`, but it fails hard at the ORC tests. > I > suspect it has to do with the external libraries not being compiled or put > in the right location. > I've tried running `SPARK_TESTING=1 ./bin/pyspark > python/pyspark/streaming/tests.py` as suggested, but this doesn't work on > Spark 2.0. > I've tried running `SPARK_TESTING=1 ./bin/spark-submit > python/pyspark/streaming/tests.py`and that worked a little better, but it > failed at `pyspark.streaming.tests.KafkaStreamTests`, with > `java.lang.ClassNotFoundException: > org.apache.spark.streaming.kafka.KafkaTestUtils`. I suspect the same issue > with external libraries. > > I've compiling Spark with `build/mvn -Pyarn -Phadoop-2.4 > -Dhadoop.version=2.4.0 -DskipTests clean package` with no trouble. > > Is there any better documentation somewhere about how to run the PySpark > tests? > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau
Re: Ability to auto-detect input data for datasources (by file extension).
Thanks for the email. Don't make it that complicated. We just want to simplify the common cases (e.g. csv/parquet), and don't need this to work for everything out there. On Thu, Feb 18, 2016 at 9:25 PM, Hyukjin Kwon wrote: > Hi all, > > I am planning to submit a PR for > https://issues.apache.org/jira/browse/SPARK-8000. > > Currently, file format is not detected by the file extension unlike > compression codecs are being detected. > > I am thinking of introducing another interface (a function) at > DataSourceRegister just like shortName() at in order to specify possible > file exceptions so that we can detect datasources by file extensions just > like Hadoop does for compression codecs. > > Since adding an interface should be carefully done, I want to first ask if > this approach looks appropriate. > > Could you please give me some feedback for this? > > > Thanks! >
Ability to auto-detect input data for datasources (by file extension).
Hi all, I am planning to submit a PR for https://issues.apache.org/jira/browse/SPARK-8000. Currently, file format is not detected by the file extension unlike compression codecs are being detected. I am thinking of introducing another interface (a function) at DataSourceRegister just like shortName() at in order to specify possible file exceptions so that we can detect datasources by file extensions just like Hadoop does for compression codecs. Since adding an interface should be carefully done, I want to first ask if this approach looks appropriate. Could you please give me some feedback for this? Thanks!
Re: Welcoming two new committers
Awesome! Congrats and welcome!! 2016-02-18 11:26 GMT+08:00 Cheng Lian : > Awesome! Congrats and welcome!! > > Cheng > > On Tue, Feb 9, 2016 at 2:55 AM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Congrats!!! Herman and Wenchen!!! >> >> >> On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende >> wrote: >> >>> >>> >>> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia >>> wrote: >>> Hi all, The PMC has recently added two new Spark committers -- Herman van Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten, adding new features, optimizations and APIs. Please join me in welcoming Herman and Wenchen. Matei >>> >>> Congratulations !!! >>> >>> -- >>> Luciano Resende >>> http://people.apache.org/~lresende >>> http://twitter.com/lresende1975 >>> http://lresende.blogspot.com/ >>> >> >> >
How to run PySpark tests?
Hi, I'm trying to finish up a PR (https://github.com/apache/spark/pull/10089) which is currently failing PySpark tests. The instructions to run the test suite seem a little dated. I was able to find these: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals http://spark.apache.org/docs/latest/building-spark.html I've tried running `python/run-tests`, but it fails hard at the ORC tests. I suspect it has to do with the external libraries not being compiled or put in the right location. I've tried running `SPARK_TESTING=1 ./bin/pyspark python/pyspark/streaming/tests.py` as suggested, but this doesn't work on Spark 2.0. I've tried running `SPARK_TESTING=1 ./bin/spark-submit python/pyspark/streaming/tests.py`and that worked a little better, but it failed at `pyspark.streaming.tests.KafkaStreamTests`, with `java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaTestUtils`. I suspect the same issue with external libraries. I've compiling Spark with `build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package` with no trouble. Is there any better documentation somewhere about how to run the PySpark tests? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: DataFrame API and Ordering
You are correct and we should document that. Any suggestions on where we should document this? In DoubleType and FloatType? On Tuesday, February 16, 2016, Maciej Szymkiewicz wrote: > I am not sure if I've missed something obvious but as far as I can tell > DataFrame API doesn't provide a clearly defined ordering rules excluding > NaN handling. Methods like DataFrame.sort or sql.functions like min / > max provide only general description. Discrepancy between functions.max > (min) and GroupedData.max where the latter one supports only numeric > makes current situation even more confusing. With growing number of > orderable types I believe that documentation should clearly define > ordering rules including: > > - NULL behavior > - collation > - behavior on complex types (structs, arrays) > > While this information can extracted from the source it is not easily > accessible and without explicit specification it is not clear if current > behavior is contractual. It can be also confusing if user expects an > order depending on a current locale (R). > > Best, > Maciej > >
Re: Kafka connector mention in Matei's keynote
I think Matei was referring to the Kafka direct streaming source added in 2015. On Thu, Feb 18, 2016 at 11:59 AM, Cody Koeninger wrote: > I saw this slide: > http://image.slidesharecdn.com/east2016v2matei-160217154412/95/2016-spark-summit-east-keynote-matei-zaharia-5-638.jpg?cb=1455724433 > > Didn't see the talk - was this just referring to the existing work on the > spark-streaming-kafka subproject, or is someone actually working on making > Kafka Connect ( http://docs.confluent.io/2.0.0/connect/ ) play nice with > Spark? > >
Kafka connector mention in Matei's keynote
I saw this slide: http://image.slidesharecdn.com/east2016v2matei-160217154412/95/2016-spark-summit-east-keynote-matei-zaharia-5-638.jpg?cb=1455724433 Didn't see the talk - was this just referring to the existing work on the spark-streaming-kafka subproject, or is someone actually working on making Kafka Connect ( http://docs.confluent.io/2.0.0/connect/ ) play nice with Spark?
Re: SPARK-9559
YARN may be a workaround. On Thu, Feb 18, 2016 at 4:13 PM, Ashish Soni wrote: > Hi All , > > Just wanted to know if there is any work around or resolution for below > issue in Stand alone mode > > https://issues.apache.org/jira/browse/SPARK-9559 > > Ashish >
SPARK-9559
Hi All , Just wanted to know if there is any work around or resolution for below issue in Stand alone mode https://issues.apache.org/jira/browse/SPARK-9559 Ashish