Re: Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Prabhu Joseph
Fair Scheduler, YARN Queue has the entire cluster resource as maxResource,
preemption does not come into picture during test case, all the spark jobs
got the requested resource.

The concurrent jobs with different spark context runs fine, so suspecting
on resource contention is not a correct one.

The performace degrades only for concurrent jobs on shared spark context.
Is SparkContext has any critical section, which needs locking, and jobs
waiting to read that. I know Spark and Scala is not a old thread model, it
uses Actor Model, where locking does not happen, but still want to verify
is java old  threading is used somewhere.



On Friday, February 19, 2016, Jörn Franke  wrote:

> How did you configure YARN queues? What scheduler? Preemption ?
>
> > On 19 Feb 2016, at 06:51, Prabhu Joseph  > wrote:
> >
> > Hi All,
> >
> >When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share
> a single Spark Context, the jobs take more time to complete comparing with
> when they ran with different Spark Context.
> > The spark jobs are submitted on different threads.
> >
> > Test Case:
> >
> > A.  3 spark jobs submitted serially
> > B.  3 spark jobs submitted concurrently and with different
> SparkContext
> > C.  3 spark jobs submitted concurrently and with same Spark Context
> > D.  3 spark jobs submitted concurrently and with same Spark Context
> and tripling the resources.
> >
> > A and B takes equal time, But C and D are taking 2-3 times longer than
> A, which shows concurrency does not improve with shared Spark Context.
> [Spark Job Server]
> >
> > Thanks,
> > Prabhu Joseph
>


Re: Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Jörn Franke
How did you configure YARN queues? What scheduler? Preemption ?

> On 19 Feb 2016, at 06:51, Prabhu Joseph  wrote:
> 
> Hi All,
> 
>When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a 
> single Spark Context, the jobs take more time to complete comparing with when 
> they ran with different Spark Context.
> The spark jobs are submitted on different threads.
> 
> Test Case: 
>   
> A.  3 spark jobs submitted serially
> B.  3 spark jobs submitted concurrently and with different SparkContext
> C.  3 spark jobs submitted concurrently and with same Spark Context
> D.  3 spark jobs submitted concurrently and with same Spark Context and 
> tripling the resources.
> 
> A and B takes equal time, But C and D are taking 2-3 times longer than A, 
> which shows concurrency does not improve with shared Spark Context. [Spark 
> Job Server]
> 
> Thanks,
> Prabhu Joseph

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Prabhu Joseph
Hi All,

   When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a
single Spark Context, the jobs take more time to complete comparing with
when they ran with different Spark Context.
The spark jobs are submitted on different threads.

Test Case:

A.  3 spark jobs submitted serially
B.  3 spark jobs submitted concurrently and with different SparkContext
C.  3 spark jobs submitted concurrently and with same Spark Context
D.  3 spark jobs submitted concurrently and with same Spark Context and
tripling the resources.

A and B takes equal time, But C and D are taking 2-3 times longer than A,
which shows concurrency does not improve with shared Spark Context. [Spark
Job Server]

Thanks,
Prabhu Joseph


Re: How to run PySpark tests?

2016-02-18 Thread Holden Karau
Great - I'll update the wiki.

On Thu, Feb 18, 2016 at 8:34 PM, Jason White 
wrote:

> Compiling with `build/mvn -Pyarn -Phadoop-2.4 -Phive -Dhadoop.version=2.4.0
> -DskipTests clean package` followed by `python/run-tests` seemed to do the
> trick! Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357p16362.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: How to run PySpark tests?

2016-02-18 Thread Jason White
Compiling with `build/mvn -Pyarn -Phadoop-2.4 -Phive -Dhadoop.version=2.4.0
-DskipTests clean package` followed by `python/run-tests` seemed to do the
trick! Thanks!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357p16362.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to run PySpark tests?

2016-02-18 Thread Holden Karau
I've run into some problems with the Python tests in the past when I
haven't built with hive support, you might want to build your assembly with
hive support and see if that helps.

On Thursday, February 18, 2016, Jason White  wrote:

> Hi,
>
> I'm trying to finish up a PR (https://github.com/apache/spark/pull/10089)
> which is currently failing PySpark tests. The instructions to run the test
> suite seem a little dated. I was able to find these:
> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
> http://spark.apache.org/docs/latest/building-spark.html
>
> I've tried running `python/run-tests`, but it fails hard at the ORC tests.
> I
> suspect it has to do with the external libraries not being compiled or put
> in the right location.
> I've tried running `SPARK_TESTING=1 ./bin/pyspark
> python/pyspark/streaming/tests.py` as suggested, but this doesn't work on
> Spark 2.0.
> I've tried running `SPARK_TESTING=1 ./bin/spark-submit
> python/pyspark/streaming/tests.py`and that worked a little better, but it
> failed at `pyspark.streaming.tests.KafkaStreamTests`, with
> `java.lang.ClassNotFoundException:
> org.apache.spark.streaming.kafka.KafkaTestUtils`. I suspect the same issue
> with external libraries.
>
> I've compiling Spark with `build/mvn -Pyarn -Phadoop-2.4
> -Dhadoop.version=2.4.0 -DskipTests clean package` with no trouble.
>
> Is there any better documentation somewhere about how to run the PySpark
> tests?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Ability to auto-detect input data for datasources (by file extension).

2016-02-18 Thread Reynold Xin
Thanks for the email.

Don't make it that complicated. We just want to simplify the common cases
(e.g. csv/parquet), and don't need this to work for everything out there.


On Thu, Feb 18, 2016 at 9:25 PM, Hyukjin Kwon  wrote:

> Hi all,
>
> I am planning to submit a PR for
> https://issues.apache.org/jira/browse/SPARK-8000.
>
> Currently, file format is not detected by the file extension unlike
> compression codecs are being detected.
>
> I am thinking of introducing another interface (a function) at
> DataSourceRegister just like shortName() at in order to specify possible
> file exceptions so that we can detect datasources by file extensions just
> like Hadoop does for compression codecs.
>
> Since adding an interface should be carefully done, I want to first ask if
> this approach looks appropriate.
>
> Could you please give me some feedback for this?
>
>
> Thanks!
>


Ability to auto-detect input data for datasources (by file extension).

2016-02-18 Thread Hyukjin Kwon
Hi all,

I am planning to submit a PR for
https://issues.apache.org/jira/browse/SPARK-8000.

Currently, file format is not detected by the file extension unlike
compression codecs are being detected.

I am thinking of introducing another interface (a function) at
DataSourceRegister just like shortName() at in order to specify possible
file exceptions so that we can detect datasources by file extensions just
like Hadoop does for compression codecs.

Since adding an interface should be carefully done, I want to first ask if
this approach looks appropriate.

Could you please give me some feedback for this?


Thanks!


Re: Welcoming two new committers

2016-02-18 Thread 刘畅
Awesome! Congrats and welcome!!

2016-02-18 11:26 GMT+08:00 Cheng Lian :

> Awesome! Congrats and welcome!!
>
> Cheng
>
> On Tue, Feb 9, 2016 at 2:55 AM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Congrats!!! Herman and Wenchen!!!
>>
>>
>> On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende 
>> wrote:
>>
>>>
>>>
>>> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
>>> wrote:
>>>
 Hi all,

 The PMC has recently added two new Spark committers -- Herman van
 Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and
 Tungsten, adding new features, optimizations and APIs. Please join me in
 welcoming Herman and Wenchen.

 Matei

>>>
>>> Congratulations !!!
>>>
>>> --
>>> Luciano Resende
>>> http://people.apache.org/~lresende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>
>>
>


How to run PySpark tests?

2016-02-18 Thread Jason White
Hi,

I'm trying to finish up a PR (https://github.com/apache/spark/pull/10089)
which is currently failing PySpark tests. The instructions to run the test
suite seem a little dated. I was able to find these:
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
http://spark.apache.org/docs/latest/building-spark.html

I've tried running `python/run-tests`, but it fails hard at the ORC tests. I
suspect it has to do with the external libraries not being compiled or put
in the right location.
I've tried running `SPARK_TESTING=1 ./bin/pyspark
python/pyspark/streaming/tests.py` as suggested, but this doesn't work on
Spark 2.0.
I've tried running `SPARK_TESTING=1 ./bin/spark-submit
python/pyspark/streaming/tests.py`and that worked a little better, but it
failed at `pyspark.streaming.tests.KafkaStreamTests`, with
`java.lang.ClassNotFoundException:
org.apache.spark.streaming.kafka.KafkaTestUtils`. I suspect the same issue
with external libraries.

I've compiling Spark with `build/mvn -Pyarn -Phadoop-2.4
-Dhadoop.version=2.4.0 -DskipTests clean package` with no trouble.

Is there any better documentation somewhere about how to run the PySpark
tests?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: DataFrame API and Ordering

2016-02-18 Thread Reynold Xin
You are correct and we should document that.

Any suggestions on where we should document this? In DoubleType and
FloatType?

On Tuesday, February 16, 2016, Maciej Szymkiewicz 
wrote:

> I am not sure if I've missed something obvious but as far as I can tell
> DataFrame API doesn't provide a clearly defined ordering rules excluding
> NaN handling. Methods like DataFrame.sort or sql.functions like min /
> max provide only general description. Discrepancy between functions.max
> (min) and GroupedData.max where the latter one supports only numeric
> makes current situation even more confusing. With growing number of
> orderable types I believe that documentation should clearly define
> ordering rules including:
>
> - NULL behavior
> - collation
> - behavior on complex types (structs, arrays)
>
> While this information can extracted from the source it is not easily
> accessible and without explicit specification it is not clear if current
> behavior is contractual. It can be also confusing if user expects an
> order depending on a current locale (R).
>
> Best,
> Maciej
>
>


Re: Kafka connector mention in Matei's keynote

2016-02-18 Thread Reynold Xin
I think Matei was referring to the Kafka direct streaming source added in
2015.


On Thu, Feb 18, 2016 at 11:59 AM, Cody Koeninger  wrote:

> I saw this slide:
> http://image.slidesharecdn.com/east2016v2matei-160217154412/95/2016-spark-summit-east-keynote-matei-zaharia-5-638.jpg?cb=1455724433
>
> Didn't see the talk - was this just referring to the existing work on the
> spark-streaming-kafka subproject, or is someone actually working on making
> Kafka Connect ( http://docs.confluent.io/2.0.0/connect/ ) play nice with
> Spark?
>
>


Kafka connector mention in Matei's keynote

2016-02-18 Thread Cody Koeninger
I saw this slide:
http://image.slidesharecdn.com/east2016v2matei-160217154412/95/2016-spark-summit-east-keynote-matei-zaharia-5-638.jpg?cb=1455724433

Didn't see the talk - was this just referring to the existing work on the
spark-streaming-kafka subproject, or is someone actually working on making
Kafka Connect ( http://docs.confluent.io/2.0.0/connect/ ) play nice with
Spark?


Re: SPARK-9559

2016-02-18 Thread Daniel Darabos
YARN may be a workaround.

On Thu, Feb 18, 2016 at 4:13 PM, Ashish Soni  wrote:

> Hi All ,
>
> Just wanted to know if there is any work around or resolution for below
> issue in Stand alone mode
>
> https://issues.apache.org/jira/browse/SPARK-9559
>
> Ashish
>


SPARK-9559

2016-02-18 Thread Ashish Soni
Hi All ,

Just wanted to know if there is any work around or resolution for below
issue in Stand alone mode

https://issues.apache.org/jira/browse/SPARK-9559

Ashish