Re: Speeding up Catalyst engine

2017-07-25 Thread Maciej Bryński
Hi,

I did backport this to 2.2.
First results of tests (join of about 60 tables).
Vanilla Spark: 50 sec
With 20392 - 38 sec
With 20392 and spark.sql.selfJoinAutoResolveAmbiguity=false - 29 sec
Vanilla Spark with spark.sql.selfJoinAutoResolveAmbiguity=false - 34 sec

I didn't measure any difference
changing spark.sql.constraintPropagation.enabled and any other spark.sql
option.

So I will leave your patch on top of 2.2
Thank you.

M.

2017-07-25 1:39 GMT+02:00 Liang-Chi Hsieh <vii...@gmail.com>:

>
> Hi Maciej,
>
> For backportting https://issues.apache.org/jira/browse/SPARK-20392, you
> can
> see the suggestion from committers on the PR. I think we don't expect it
> will be merged into 2.2.
>
>
>
> Maciej Bryński wrote
> > Hi Everyone,
> > I'm trying to speed up my Spark streaming application and I have
> following
> > problem.
> > I'm using a lot of joins in my app and full catalyst analysis is
> triggered
> > during every join.
> >
> > I found 2 options to speed up.
> >
> > 1) spark.sql.selfJoinAutoResolveAmbiguity  option
> > But looking at code:
> > https://github.com/apache/spark/blob/8cd9cdf17a7a4ad6f2eecd7c4b388c
> a363c20982/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L918
> >
> > Shouldn't lines 925-927 be before 920-922 ?
> >
> > 2) https://issues.apache.org/jira/browse/SPARK-20392
> >
> > Is it safe to use it on top of 2.2.0 ?
> >
> > Regards,
> > --
> > Maciek Bryński
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Speeding-up-
> Catalyst-engine-tp22013p22014.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Maciek Bryński


Speeding up Catalyst engine

2017-07-24 Thread Maciej Bryński
Hi Everyone,
I'm trying to speed up my Spark streaming application and I have following
problem.
I'm using a lot of joins in my app and full catalyst analysis is triggered
during every join.

I found 2 options to speed up.

1) spark.sql.selfJoinAutoResolveAmbiguity  option
But looking at code:
https://github.com/apache/spark/blob/8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L918

Shouldn't lines 925-927 be before 920-922 ?

2) https://issues.apache.org/jira/browse/SPARK-20392

Is it safe to use it on top of 2.2.0 ?

Regards,
-- 
Maciek Bryński


Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-19 Thread Maciej Bryński
Oh yeah, new Spark version, new regression bugs :)

https://issues.apache.org/jira/browse/SPARK-21470

M.

2017-07-17 22:01 GMT+02:00 Sam Elamin :

> Well done!  This is amazing news :) Congrats and really cant wait to
> spread the structured streaming love!
>
> On Mon, Jul 17, 2017 at 5:25 PM, kant kodali  wrote:
>
>> +1
>>
>> On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin  wrote:
>>
>>> Awesome! Congrats! Can't wait!!
>>>
>>> jg
>>>
>>>
>>> On Jul 11, 2017, at 18:48, Michael Armbrust 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> Apache Spark 2.2.0 is the third release of the Spark 2.x line. This
>>> release removes the experimental tag from Structured Streaming. In
>>> addition, this release focuses on usability, stability, and polish,
>>> resolving over 1100 tickets.
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to this release. This release would not have been
>>> possible without you.
>>>
>>> To download Spark 2.2.0, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes: https://spark.apache.or
>>> g/releases/spark-release-2-2-0.html
>>>
>>> *(note: If you see any issues with the release notes, webpage or
>>> published artifacts, please contact me directly off-list) *
>>>
>>> Michael
>>>
>>>
>>
>


-- 
Maciek Bryński


Re: Slowness of Spark Thrift Server

2017-07-17 Thread Maciej Bryński
I did the test on Spark 2.2.0 and problem still exists.

Any ideas how to fix it ?

Regards,
Maciek

2017-07-11 11:52 GMT+02:00 Maciej Bryński <mac...@brynski.pl>:

> Hi,
> I have following issue.
> I'm trying to use Spark as a proxy to Cassandra.
> The problem is the thrift server overhead.
>
> I'm using following query:
> select * from table where primay_key = 123
>
> Job time (from jobs tab) is around 50ms. (and it's similar to query time
> from SQL tab)
> Unfortunately query time from JDBC/ODBC Server is 650 ms.
> Any ideas why ? What could cause such an overhead ?
>
> Regards,
> --
> Maciek Bryński
>



-- 
Maciek Bryński


Slowness of Spark Thrift Server

2017-07-11 Thread Maciej Bryński
Hi,
I have following issue.
I'm trying to use Spark as a proxy to Cassandra.
The problem is the thrift server overhead.

I'm using following query:
select * from table where primay_key = 123

Job time (from jobs tab) is around 50ms. (and it's similar to query time
from SQL tab)
Unfortunately query time from JDBC/ODBC Server is 650 ms.
Any ideas why ? What could cause such an overhead ?

Regards,
-- 
Maciek Bryński


Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Maciej Bryński
https://issues.apache.org/jira/browse/SPARK-12717

This bug is in Spark since 1.6.0.
Any chance to get this fixed ?

M.

2017-04-14 6:39 GMT+02:00 Holden Karau :
> If it would help I'd be more than happy to look at kicking off the packaging
> for RC3 since I'v been poking around in Jenkins a bit (for SPARK-20216 &
> friends) (I'd still probably need some guidance from a previous release
> coordinator so I understand if that's not actually faster).
>
> On Mon, Apr 10, 2017 at 6:39 PM, DB Tsai  wrote:
>>
>> I backported the fix into both branch-2.1 and branch-2.0. Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0x5CED8B896A6BDFA0
>>
>>
>> On Mon, Apr 10, 2017 at 4:20 PM, Ryan Blue  wrote:
>> > DB,
>> >
>> > This vote already failed and there isn't a RC3 vote yet. If you backport
>> > the
>> > changes to branch-2.1 they will make it into the next RC.
>> >
>> > rb
>> >
>> > On Mon, Apr 10, 2017 at 3:55 PM, DB Tsai  wrote:
>> >>
>> >> -1
>> >>
>> >> I think that back-porting SPARK-20270 and SPARK-18555 are very
>> >> important
>> >> since it's a critical bug that na.fill will mess up the data in Long
>> >> even
>> >> the data isn't null.
>> >>
>> >> Thanks.
>> >>
>> >>
>> >> Sincerely,
>> >>
>> >> DB Tsai
>> >> --
>> >> Web: https://www.dbtsai.com
>> >> PGP Key ID: 0x5CED8B896A6BDFA0
>> >>
>> >> On Wed, Apr 5, 2017 at 11:12 AM, Holden Karau 
>> >> wrote:
>> >>>
>> >>> Following up, the issues with missing pypandoc/pandoc on the packaging
>> >>> machine has been resolved.
>> >>>
>> >>> On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau 
>> >>> wrote:
>> 
>>  See SPARK-20216, if Michael can let me know which machine is being
>>  used
>>  for packaging I can see if I can install pandoc on it (should be
>>  simple but
>>  I know the Jenkins cluster is a bit on the older side).
>> 
>>  On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau 
>>  wrote:
>> >
>> > So the fix is installing pandoc on whichever machine is used for
>> > packaging. I thought that was generally done on the machine of the
>> > person
>> > rolling the release so I wasn't sure it made sense as a JIRA, but
>> > from
>> > chatting with Josh it sounds like that part might be on of the
>> > Jenkins
>> > workers - is there a fixed one that is used?
>> >
>> > Regardless I'll file a JIRA for this when I get back in front of my
>> > desktop (~1 hour or so).
>> >
>> > On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust
>> >  wrote:
>> >>
>> >> Thanks for the comments everyone.  This vote fails.  Here's how I
>> >> think we should proceed:
>> >>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>> >>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
>> >> report if this is a regression and if there is an easy fix that we
>> >> should
>> >> wait for.
>> >>
>> >> For all the other test failures, please take the time to look
>> >> through
>> >> JIRA and open an issue if one does not already exist so that we can
>> >> triage
>> >> if these are just environmental issues.  If I don't hear any
>> >> objections I'm
>> >> going to go ahead with RC3 tomorrow.
>> >>
>> >> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung
>> >>  wrote:
>> >>>
>> >>> -1
>> >>> sorry, found an issue with SparkR CRAN check.
>> >>> Opened SPARK-20197 and working on fix.
>> >>>
>> >>> 
>> >>> From: holden.ka...@gmail.com  on behalf of
>> >>> Holden Karau 
>> >>> Sent: Friday, March 31, 2017 6:25:20 PM
>> >>> To: Xiao Li
>> >>> Cc: Michael Armbrust; dev@spark.apache.org
>> >>> Subject: Re: [VOTE] Apache Spark 2.1.1 (RC2)
>> >>>
>> >>> -1 (non-binding)
>> >>>
>> >>> Python packaging doesn't seem to have quite worked out (looking at
>> >>> PKG-INFO the description is "Description: ! missing pandoc do
>> >>> not upload
>> >>> to PyPI "), ideally it would be nice to have this as a version
>> >>> we
>> >>> upgrade to PyPi.
>> >>> Building this on my own machine results in a longer description.
>> >>>
>> >>> My guess is that whichever machine was used to package this is
>> >>> missing the pandoc executable (or possibly pypandoc library).
>> >>>
>> >>> On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li 
>> >>> wrote:
>> 
>>  +1
>> 
>>  Xiao
>> 
>>  2017-03-30 16:09 GMT-07:00 Michael Armbrust
>>  :

Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Maciej Bryński
2017-04-06 4:00 GMT+02:00 Michael Segel :
> Just out of curiosity, what would happen if you put your 10K values in to a 
> temp table and then did a join against it?

The answer is predicates pushdown.
In my case I'm using this kind of query on JDBC table and IN predicate
is executed on DB in less than 1s.


Regards,
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Maciej Bryński
Hi,
I'm trying to run queries with many values in IN operator.

The result is that for more than 10K values IN operator is getting slower.

For example this code is running about 20 seconds.

df = spark.range(0,10,1,1)
df.where('id in ({})'.format(','.join(map(str,range(10).count()

Any ideas how to improve this ?
Is it a bug ?
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-30 Thread Maciej Bryński
+1

2016-09-30 7:01 GMT+02:00 vaquar khan :

> +1 (non-binding)
> Regards,
> Vaquar  khan
>
> On 29 Sep 2016 23:00, "Denny Lee"  wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Sep 29, 2016 at 9:43 PM Jeff Zhang  wrote:
>>
>>> +1
>>>
>>> On Fri, Sep 30, 2016 at 9:27 AM, Burak Yavuz  wrote:
>>>
 +1

 On Sep 29, 2016 4:33 PM, "Kyle Kelley"  wrote:

> +1
>
> On Thu, Sep 29, 2016 at 4:27 PM, Yin Huai 
> wrote:
>
>> +1
>>
>> On Thu, Sep 29, 2016 at 4:07 PM, Luciano Resende <
>> luckbr1...@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and
 passes if a majority of at least 3+1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.0.1
 [ ] -1 Do not release this package because ...


 The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa
 4577ba4be38)

 This release candidate resolves 301 issues:
 https://s.apache.org/spark-2.0.1-jira

 The release files, including signatures, digests, etc. can be found
 at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.
 1-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapache
 spark-1203/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.
 1-rc4-docs/


 Q: How can I help test this release?
 A: If you are a Spark user, you can help us test this release by
 taking an existing Spark workload and running on this release 
 candidate,
 then reporting any regressions from 2.0.0.

 Q: What justifies a -1 vote for this release?
 A: This is a maintenance release in the 2.0.x series.  Bugs already
 present in 2.0.0, missing features, or bugs related to new features 
 will
 not necessarily block this release.

 Q: What fix version should I use for patches merging into
 branch-2.0 from now on?
 A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a
 new RC (i.e. RC5) is cut, I will change the fix version of those 
 patches to
 2.0.1.



>>>
>>>
>>> --
>>> Luciano Resende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>
>>
>
>
> --
> Kyle Kelley (@rgbkrk ; lambdaops.com)
>

>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>


-- 
Maciek Bryński


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Maciej Bryński
+1
At last :)

2016-09-26 19:56 GMT+02:00 Sameer Agarwal :

> +1 (non-binding)
>
> On Mon, Sep 26, 2016 at 9:54 AM, Davies Liu  wrote:
>
>> +1 (non-binding)
>>
>> On Mon, Sep 26, 2016 at 9:36 AM, Joseph Bradley 
>> wrote:
>> > +1
>> >
>> > On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee 
>> wrote:
>> >>
>> >> +1 (non-binding)
>> >> On Sun, Sep 25, 2016 at 23:20 Jeff Zhang  wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Mon, Sep 26, 2016 at 2:03 PM, Shixiong(Ryan) Zhu
>> >>>  wrote:
>> 
>>  +1
>> 
>>  On Sun, Sep 25, 2016 at 10:43 PM, Pete Lee 
>>  wrote:
>> >
>> > +1
>> >
>> >
>> > On Sun, Sep 25, 2016 at 3:26 PM, Herman van Hövell tot Westerflier
>> >  wrote:
>> >>
>> >> +1 (non-binding)
>> >>
>> >> On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida
>> >>  wrote:
>> >>>
>> >>> +1 (non-binding)
>> >>>
>> >>> Built and tested on
>> >>> - Ubuntu 16.04 / OpenJDK 1.8.0_91
>> >>> - CentOS / Oracle Java 1.7.0_55
>> >>> (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver
>> >>> -Pyarn)
>> >>>
>> >>>
>> >>> On 25 September 2016 at 22:35, Matei Zaharia
>> >>>  wrote:
>> 
>>  +1
>> 
>>  Matei
>> 
>>  On Sep 25, 2016, at 1:25 PM, Josh Rosen <
>> joshro...@databricks.com>
>>  wrote:
>> 
>>  +1
>> 
>>  On Sun, Sep 25, 2016 at 1:16 PM Yin Huai 
>>  wrote:
>> >
>> > +1
>> >
>> > On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun
>> >  wrote:
>> >>
>> >> +1 (non binding)
>> >>
>> >> RC3 is compiled and tested on the following two systems, too.
>> All
>> >> tests passed.
>> >>
>> >> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
>> >>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> >> -Phive-thriftserver -Dsparkr
>> >> * CentOS 7.2 / Open JDK 1.8.0_102
>> >>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> >> -Phive-thriftserver
>> >>
>> >> Cheers,
>> >> Dongjoon
>> >>
>> >>
>> >>
>> >> On Saturday, September 24, 2016, Reynold Xin <
>> r...@databricks.com>
>> >> wrote:
>> >>>
>> >>> Please vote on releasing the following candidate as Apache
>> Spark
>> >>> version 2.0.1. The vote is open until Tue, Sep 27, 2016 at
>> 15:30 PDT and
>> >>> passes if a majority of at least 3+1 PMC votes are cast.
>> >>>
>> >>> [ ] +1 Release this package as Apache Spark 2.0.1
>> >>> [ ] -1 Do not release this package because ...
>> >>>
>> >>>
>> >>> The tag to be voted on is v2.0.1-rc3
>> >>> (9d28cc10357a8afcfb2fa2e6eecb5c2cc2730d17)
>> >>>
>> >>> This release candidate resolves 290 issues:
>> >>> https://s.apache.org/spark-2.0.1-jira
>> >>>
>> >>> The release files, including signatures, digests, etc. can be
>> >>> found at:
>> >>>
>> >>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.
>> 1-rc3-bin/
>> >>>
>> >>> Release artifacts are signed with the following key:
>> >>> https://people.apache.org/keys/committer/pwendell.asc
>> >>>
>> >>> The staging repository for this release can be found at:
>> >>>
>> >>> https://repository.apache.org/content/repositories/orgapache
>> spark-1201/
>> >>>
>> >>> The documentation corresponding to this release can be found
>> at:
>> >>>
>> >>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.
>> 1-rc3-docs/
>> >>>
>> >>>
>> >>> Q: How can I help test this release?
>> >>> A: If you are a Spark user, you can help us test this release
>> by
>> >>> taking an existing Spark workload and running on this release
>> candidate,
>> >>> then reporting any regressions from 2.0.0.
>> >>>
>> >>> Q: What justifies a -1 vote for this release?
>> >>> A: This is a maintenance release in the 2.0.x series.  Bugs
>> >>> already present in 2.0.0, missing features, or bugs related
>> to new features
>> >>> will not necessarily block this release.
>> >>>
>> >>> Q: What fix version should I use for patches merging into
>> >>> branch-2.0 from now on?
>> >>> A: Please mark the fix version as 2.0.2, rather than 2.0.1.
>> If a
>> >>> new RC (i.e. RC4) is cut, I will change the fix version of
>> those patches to
>> >>> 2.0.1.
>> >>>
>> >>>
>> 

Cache'ing performance

2016-08-27 Thread Maciej Bryński
Hi,
I did some benchmark of cache function today.

*RDD*
sc.parallelize(0 until Int.MaxValue).cache().count()

*Datasets*
spark.range(Int.MaxValue).cache().count()

For me Datasets was 2 times slower.

Results (3 nodes, 20 cores and 48GB RAM each)
*RDD - 6s*
*Datasets - 13,5 s*

Is that expected behavior for Datasets and Encoders ?

Regards,
-- 
Maciek Bryński


Re: Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Maciej Bryński
2016-08-27 15:27 GMT+02:00 Julien Dumazert :

> df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _)


I think reduce and sum has very different performance.
Did you try sql.functions.sum ?
Or of you want to benchmark access to Row object then  count() function
will be better idea.

Regards,
-- 
Maciek Bryński


Re: Tree for SQL Query

2016-08-25 Thread Maciej Bryński
@rxin
It's not I'm looking for.
Explain prints output like this.
== Physical Plan ==
*Project [id#1576L AS id#1582L]
+- *Range (0, 1000, splits=400)

I'd like to have whole tree with expressions.

So when I have "select x + y" there should by Add expresion etc.

M.

2016-08-24 22:39 GMT+02:00 Reynold Xin <r...@databricks.com>:
> It's basically the output of the explain command.
>
>
> On Wed, Aug 24, 2016 at 12:31 PM, Maciej Bryński <mac...@brynski.pl> wrote:
>>
>> Hi,
>> I read this article:
>>
>> https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
>>
>> And I have a question. Is it possible to get / print Tree for SQL Query ?
>>
>> Something like this:
>>
>> Add(Attribute(x), Add(Literal(1), Literal(2)))
>>
>> Regards,
>> --
>> Maciek Bryński
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>



-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: GraphFrames 0.2.0 released

2016-08-24 Thread Maciej Bryński
Hi,
Do you plan to add tag for this release on github ?
https://github.com/graphframes/graphframes/releases

Regards,
Maciek

2016-08-17 3:18 GMT+02:00 Jacek Laskowski :

> Hi Tim,
>
> AWESOME. Thanks a lot for releasing it. That makes me even more eager
> to see it in Spark's codebase (and replacing the current RDD-based
> API)!
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Aug 16, 2016 at 9:32 AM, Tim Hunter 
> wrote:
> > Hello all,
> > I have released version 0.2.0 of the GraphFrames package. Apart from a
> few
> > bug fixes, it is the first release published for Spark 2.0 and both scala
> > 2.10 and 2.11. Please let us know if you have any comment or questions.
> >
> > It is available as a Spark package:
> > https://spark-packages.org/package/graphframes/graphframes
> >
> > The source code is available as always at
> > https://github.com/graphframes/graphframes
> >
> >
> > What is GraphFrames?
> >
> > GraphFrames is a DataFrame-based graph engine Spark. In addition to the
> > algorithms available in GraphX, users can write highly expressive
> queries by
> > leveraging the DataFrame API, combined with a new API for motif finding.
> The
> > user also benefits from DataFrame performance optimizations within the
> Spark
> > SQL engine.
> >
> > Cheers
> >
> > Tim
> >
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Maciek Bryński


Re: Spark SQL and Kryo registration

2016-08-05 Thread Maciej Bryński
Hi Olivier,
Did you check performance of Kryo ?
I have observations that Kryo is slightly slower than Java Serializer.

Regards,
Maciek

2016-08-04 17:41 GMT+02:00 Amit Sela :

> It should. Codegen uses the SparkConf in SparkEnv when instantiating a new
> Serializer.
>
> On Thu, Aug 4, 2016 at 6:14 PM Jacek Laskowski  wrote:
>
>> Hi Olivier,
>>
>> I don't know either, but am curious what you've tried already.
>>
>> Jacek
>>
>> On 3 Aug 2016 10:50 a.m., "Olivier Girardot" <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> I'm currently to use Spark 2.0.0 and making Dataframes work with kryo.
>>> registrationRequired=true
>>> Is it even possible at all considering the codegen ?
>>>
>>> Regards,
>>>
>>> *Olivier Girardot* | Associé
>>> o.girar...@lateral-thoughts.com
>>> +33 6 24 09 17 94
>>>
>>


-- 
Maciek Bryński


Result code of whole stage codegen

2016-08-05 Thread Maciej Bryński
Hi,
I have some operation on DataFrame / Dataset.
How can I see source code for whole stage codegen ?
Is there any API for this ? Or maybe I should configure log4j in specific
way ?

Regards,
-- 
Maciek Bryński


Re: Spark jdbc update SaveMode

2016-07-22 Thread Maciej Bryński
2016-07-22 23:05 GMT+02:00 Ramon Rosa da Silva :
> Hi Folks,
>
>
>
> What do you think about allow update SaveMode from
> DataFrame.write.mode(“update”)?
>
> Now Spark just has jdbc insert.

I'm working on patch that creates new mode - 'upsert'.
In Mysql it will use 'REPLACE INTO' command.

M.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Maciej Bryński
@Michael,
I answered in Jira and could repeat here.
I think that my problem is unrelated to Hive, because I'm using
read.parquet method.
I also attached some VisualVM snapshots to SPARK-16321 (I think I should
merge both issues)
And code profiling suggest bottleneck when reading parquet file.

I wonder if there are any other benchmarks related to parquet performance.

Regards,
-- 
Maciek Bryński


Re: transtition SQLContext to SparkSession

2016-07-19 Thread Maciej Bryński
@Reynold Xin,
How this will work with Hive Support ?
SparkSession.sqlContext return HiveContext ?

2016-07-19 0:26 GMT+02:00 Reynold Xin :
> Good idea.
>
> https://github.com/apache/spark/pull/14252
>
>
>
> On Mon, Jul 18, 2016 at 12:16 PM, Michael Armbrust 
> wrote:
>>
>> + dev, reynold
>>
>> Yeah, thats a good point.  I wonder if SparkSession.sqlContext should be
>> public/deprecated?
>>
>> On Mon, Jul 18, 2016 at 8:37 AM, Koert Kuipers  wrote:
>>>
>>> in my codebase i would like to gradually transition to SparkSession, so
>>> while i start using SparkSession i also want a SQLContext to be available as
>>> before (but with a deprecated warning when i use it). this should be easy
>>> since SQLContext is now a wrapper for SparkSession.
>>>
>>> so basically:
>>> val session = SparkSession.builder.set(..., ...).getOrCreate()
>>> val sqlc = new SQLContext(session)
>>>
>>> however this doesnt work, the SQLContext constructor i am trying to use
>>> is private. SparkSession.sqlContext is also private.
>>>
>>> am i missing something?
>>>
>>> a non-gradual switch is not very realistic in any significant codebase,
>>> and i do not want to create SparkSession and SQLContext independendly (both
>>> from same SparkContext) since that can only lead to confusion and
>>> inconsistent settings.
>>
>>
>



-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Maciej Bryński
-1
https://issues.apache.org/jira/browse/SPARK-16379
https://issues.apache.org/jira/browse/SPARK-16371

2016-07-06 7:35 GMT+02:00 Reynold Xin :
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc2
> (4a55b2326c8cf50f772907a8b73fd5e7b3d1aa06).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1189/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>



-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.0 Performance drop

2016-06-30 Thread Maciej Bryński
I filled up 2 Jira.
1) Performance when queries nested column
https://issues.apache.org/jira/browse/SPARK-16320

2) Pyspark performance
https://issues.apache.org/jira/browse/SPARK-16321

I found Jira for:
1) PPD on nested columns
https://issues.apache.org/jira/browse/SPARK-5151

2) Drop of support for df.map etc. in Pyspark
https://issues.apache.org/jira/browse/SPARK-13594

2016-06-30 0:47 GMT+02:00 Michael Allman <mich...@videoamp.com>:
> The patch we use in production is for 1.5. We're porting the patch to master 
> (and downstream to 2.0, which is presently very similar) with the intention 
> of submitting a PR "soon". We'll push it here when it's ready: 
> https://github.com/VideoAmp/spark-public.
>
> Regarding benchmarking, we have a suite of Spark SQL regression tests which 
> we run to check correctness and performance. I can share our findings when I 
> have them.
>
> Cheers,
>
> Michael
>
>> On Jun 29, 2016, at 2:39 PM, Maciej Bryński <mac...@brynski.pl> wrote:
>>
>> 2016-06-29 23:22 GMT+02:00 Michael Allman <mich...@videoamp.com>:
>>> I'm sorry I don't have any concrete advice for you, but I hope this helps 
>>> shed some light on the current support in Spark for projection pushdown.
>>>
>>> Michael
>>
>> Michael,
>> Thanks for the answer. This resolves one of my questions.
>> Which Spark version you have patched ? 1.6 ? Are you planning to
>> public this patch or just for 2.0 branch ?
>>
>> I gladly help with some benchmark in my environment.
>>
>> Regards,
>> --
>> Maciek Bryński
>



-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.0 Performance drop

2016-06-29 Thread Maciej Bryński
2016-06-29 23:22 GMT+02:00 Michael Allman :
> I'm sorry I don't have any concrete advice for you, but I hope this helps 
> shed some light on the current support in Spark for projection pushdown.
>
> Michael

Michael,
Thanks for the answer. This resolves one of my questions.
Which Spark version you have patched ? 1.6 ? Are you planning to
public this patch or just for 2.0 branch ?

I gladly help with some benchmark in my environment.

Regards,
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Spark 2.0 Performance drop

2016-06-29 Thread Maciej Bryński
Hi,
Did anyone measure performance of Spark 2.0 vs Spark 1.6 ?

I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes 2x slower.

I tested following queries:
1) select count(*) where id > some_id
In this query we have PPD and performance is similar. (about 1 sec)

2) select count(*) where nested_column.id > some_id
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min
Is it normal that both version didn't do PPD ?

3) Spark connected with python
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %
10 else []).collect()
Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)

I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec

Should I expect such a drop in performance ?

BTW: why in Spark 2.0 Dataframe lost map and flatmap method ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

I'd like to create Jira for it but Apache server is down at the moment.

Regards,
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-23 Thread Maciej Bryński
-1

I need SPARK-13283 to be solved.

Regards,
Maciek Bryński

2016-06-23 0:13 GMT+02:00 Krishna Sankar :

> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 37:11 min
>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib (iPython 4.0)
> 2.0 Spark version is 1.6.2
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Lasso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK (--packages
> com.databricks:spark-csv_2.10:1.4.0)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
> 7.0. GraphX/Scala
> 7.1. Create Graph (small and bigger dataset) OK
> 7.2. Structure APIs - OK
> 7.3. Social Network/Community APIs - OK
> 7.4. Algorithms (PageRank of 2 datasets, aggregateMessages() ) OK
>
> Cheers & Good Work, Folks
> 
>
> On Sun, Jun 19, 2016 at 9:24 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and
>> passes if a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v1.6.2-rc2
>> (54b1121f351f056d6b67d2bb4efe0d553c0f7482)
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1186/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-docs/
>>
>>
>> ===
>> == How can I help test this release? ==
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions from 1.6.1.
>>
>> 
>> == What justifies a -1 vote for this release? ==
>> 
>> This is a maintenance release in the 1.6.x series.  Bugs already present
>> in 1.6.1, missing features, or bugs related to new features will not
>> necessarily block this release.
>>
>>
>>
>>
>


-- 
Maciek Bryński


Spark 1.6.0 + Hive + HBase

2016-01-28 Thread Maciej Bryński
Hi,
I'm trying to run SQL query on Hive table which is stored on HBase.
I'm using:
- Spark 1.6.0
- HDP 2.2
- Hive 0.14.0
- HBase 0.98.4

I managed to configure working classpath, but I have following problems:

1) I have UDF defined in Hive Metastore (FUNCS table).
Spark cannot use it..

 File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308,
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
: org.apache.spark.sql.AnalysisException: undefined function
dwh.str_to_map_int_str; line 55 pos 30
at
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
at
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:68)
at
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:64)
at scala.util.Try.getOrElse(Try.scala:77)
at
org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:64)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574)
at
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:573)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:570)


2) When I'm using SQL without this function Spark tries to connect to
Zookeeper on localhost.
I make a tunnel from localhost to one of the zookeeper servers but it's not
a solution.

16/01/28 10:09:18 INFO ZooKeeper: Client
environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
16/01/28 10:09:18 INFO ZooKeeper: Client environment:host.name=j4.jupyter1
16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.version=1.8.0_66
16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.vendor=Oracle
Corporation
16/01/28 10:09:18 INFO ZooKeeper: Client
environment:java.home=/usr/lib/jvm/java-8-oracle/jre
16/01/28 10:09:18 INFO ZooKeeper: Client
environment:java.class.path=/opt/spark/lib/mysql-connector-java-5.1.35-bin.jar:/opt/spark/lib/dwh-hbase-connector.jar:/opt/spark/lib/hive-hbase-handler-1.2.1.spark.jar:/opt/spark/lib/hbase-server.jar:/opt/spark/lib/hbase-common.jar:/opt/spark/lib/dwh-commons.jar:/opt/spark/lib/guava.jar:/opt/spark/lib/hbase-client.jar:/opt/spark/lib/hbase-protocol.jar:/opt/spark/lib/htrace-core.jar:/opt/spark/conf/:/opt/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark/lib/datanucleus-core-3.2.10.jar:/etc/hadoop/conf/
16/01/28 10:09:18 INFO ZooKeeper: Client
environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.io.tmpdir=/tmp
16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.compiler=
16/01/28 10:09:18 INFO ZooKeeper: Client environment:os.name=Linux
16/01/28 10:09:18 INFO ZooKeeper: Client environment:os.arch=amd64
16/01/28 10:09:18 INFO ZooKeeper: Client
environment:os.version=3.13.0-24-generic
16/01/28 10:09:18 INFO ZooKeeper: Client environment:user.name=mbrynski
16/01/28 10:09:18 INFO ZooKeeper: Client
environment:user.home=/home/mbrynski
16/01/28 10:09:18 INFO ZooKeeper: Client environment:user.dir=/home/mbrynski
16/01/28 10:09:18 INFO ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=9
watcher=hconnection-0x36079f06, quorum=localhost:2181, baseZNode=/hbase
16/01/28 10:09:18 INFO RecoverableZooKeeper: Process
identifier=hconnection-0x36079f06 connecting to ZooKeeper
ensemble=localhost:2181
16/01/28 10:09:18 INFO ClientCnxn: Opening socket connection to server
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
(unknown error)
16/01/28 10:09:18 INFO ClientCnxn: Socket connection established to
localhost/127.0.0.1:2181, initiating session
16/01/28 10:09:18 INFO ClientCnxn: Session establishment complete on server
localhost/127.0.0.1:2181, sessionid = 0x15254709ed3c8e1, negotiated timeout
= 4
16/01/28 10:09:18 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is
null


3) After making tunel I'm getting NPE.

Caused by: java.lang.NullPointerException
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:269)
at

Re: Spark 1.6.0 + Hive + HBase

2016-01-28 Thread Maciej Bryński
Ted,
You're right.
hbase-site.xml resolved problems 2 and 3, but...

Problem 4)
Spark don't push down predicates for HiveTableScan, which means that every
query is full scan.

== Physical Plan ==
TungstenAggregate(key=[],
functions=[(count(1),mode=Final,isDistinct=false)],
output=[count#144L])
+- TungstenExchange SinglePartition, None
   +- TungstenAggregate(key=[],
functions=[(count(1),mode=Partial,isDistinct=false)],
output=[count#147L])
  +- Project
 +- Filter (added_date#141L >= 20160128)
+- HiveTableScan [added_date#141L], MetastoreRelation
dwh_diagnostics, sessions_hbase, None


Is there any magic option to make this work ?

Regards,
Maciek

2016-01-28 10:25 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:

> For the last two problems, hbase-site.xml seems not to be on classpath.
>
> Once hbase-site.xml is put on classpath, you should be able to make
> progress.
>
> Cheers
>
> On Jan 28, 2016, at 1:14 AM, Maciej Bryński <mac...@brynski.pl> wrote:
>
> Hi,
> I'm trying to run SQL query on Hive table which is stored on HBase.
> I'm using:
> - Spark 1.6.0
> - HDP 2.2
> - Hive 0.14.0
> - HBase 0.98.4
>
> I managed to configure working classpath, but I have following problems:
>
> 1) I have UDF defined in Hive Metastore (FUNCS table).
> Spark cannot use it..
>
>  File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308,
> in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
> : org.apache.spark.sql.AnalysisException: undefined function
> dwh.str_to_map_int_str; line 55 pos 30
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:68)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:64)
> at scala.util.Try.getOrElse(Try.scala:77)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:64)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574)
> at
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:573)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:570)
>
>
> 2) When I'm using SQL without this function Spark tries to connect to
> Zookeeper on localhost.
> I make a tunnel from localhost to one of the zookeeper servers but it's
> not a solution.
>
> 16/01/28 10:09:18 INFO ZooKeeper: Client
> environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:host.name=j4.jupyter1
> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.version=1.8.0_66
> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.vendor=Oracle
> Corporation
> 16/01/28 10:09:18 INFO ZooKeeper: Client
> environment:java.home=/usr/lib/jvm/java-8-oracle/jre
> 16/01/28 10:09:18 INFO ZooKeeper: Client
> environment:java.class.path=/opt/spark/lib/mysql-connector-java-5.1.35-bin.jar:/opt/spark/lib/dwh-hbase-connector.jar:/opt/spark/lib/hive-hbase-handler-1.2.1.spark.jar:/opt/spark/lib/hbase-server.jar:/opt/spark/lib/hbase-common.jar:/opt/spark/lib/dwh-commons.jar:/opt/spark/lib/guava.jar:/opt/spark/lib/hbase-client.jar:/opt/spark/lib/hbase-protocol.jar:/opt/spark/lib/htrace-core.jar:/opt/spark/conf/:/opt/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark/lib/datanucleus-core-3.2.10.jar:/etc/hadoop/conf/
> 16/01/28 10:09:18 INFO ZooKeeper: Client
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.io.tmpdir=/tmp
> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.compiler=
> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:os.name=Linux
> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:os.arch=amd64
> 16/01/28 10:09:18 INFO ZooKeep

Re: Spark 1.6.0 and HDP 2.2 - problem

2016-01-13 Thread Maciej Bryński
Thanks.
I successfully compiled Spark 1.6.0 with Jackson 2.2.3 from source.

I'll try to using it.

2016-01-13 11:25 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:
> I would suggest trying option #1 first.
>
> Thanks
>
>> On Jan 13, 2016, at 2:12 AM, Maciej Bryński <mac...@brynski.pl> wrote:
>>
>> Hi,
>> I/m trying to run Spark 1.6.0 on HDP 2.2
>> Everything was fine until I tried to turn on dynamic allocation.
>> According to instruction I need to add shuffle service to yarn classpath.
>> The problem is that HDP 2.2 has jackson 2.2.3 and Spark is using 2.4.4.
>> So connecting it gives error:
>>
>> 2016-01-11 16:56:51,222 INFO  containermanager.AuxServices
>> (AuxServices.java:addService(72)) - Adding auxiliary service
>> spark_shuffle, "spark_shuffle"
>> 2016-01-11 16:56:51,439 FATAL nodemanager.NodeManager
>> (NodeManager.java:initAndStartNodeManager(465)) - Error starting
>> NodeManager
>> java.lang.NoSuchMethodError:
>> com.fasterxml.jackson.core.JsonFactory.requiresPropertyOrdering()Z
>>at 
>> com.fasterxml.jackson.databind.ObjectMapper.(ObjectMapper.java:457)
>>at 
>> com.fasterxml.jackson.databind.ObjectMapper.(ObjectMapper.java:379)
>>at 
>> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:57)
>>at 
>> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:56)
>>at 
>> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:128)
>>at 
>> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>>at 
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
>>at 
>> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>>at 
>> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>>at 
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:237)
>>at 
>> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>>at 
>> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>>at 
>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:253)
>>at 
>> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>>at 
>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:462)
>>at 
>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:509)
>>
>>
>> What can I do ?
>> I have following ideas:
>> 1) Compile Spark 1.6.0 with modified pom.xml (change jackson version
>> to 2.2.3). I'm not sure if this will be working
>> 2) I tried to put shuffle service from different version of Spark.
>> 1.4.1 works on HDP 2.2.
>> Is it possible to run shuffle service from 1.4.1 with Spark 1.6.0 ?
>> 3) Other ideas ?
>>
>> Regards,
>> --
>> Maciek Bryński
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>



-- 
Maciek Bryński

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.6.0 and HDP 2.2 - problem

2016-01-13 Thread Maciej Bryński
Steve,
Thank you for the answer.
How Hortonworks deal with this problem internally ?
You have Spark 1.3.1 in HDP 2.3. Is it compilled with Jackson 2.2.3 ?

Regards,
Maciek

2016-01-13 18:00 GMT+01:00 Steve Loughran <ste...@hortonworks.com>:
>
>> On 13 Jan 2016, at 03:23, Maciej Bryński <mac...@brynski.pl> wrote:
>>
>> Thanks.
>> I successfully compiled Spark 1.6.0 with Jackson 2.2.3 from source.
>>
>> I'll try to using it.
>>
>
> This is the eternal classpath version problem, with Jackson turning out to be 
> incredibly brittle. After one point update of the 1.x JAR broke things (it 
> removed a method), there's ~0 enthusiasm for incrementing the version 
> counters again.
>
>
> 1. Hadoop, even Hadoop trunk, is built with : 
> 2.2.3
>
> this means that right now, the Spark 1.6 shuffle isn't going to work in *any* 
> Hadoop cluster that hasn't been built with a compatible Jackson version: 
> either rebuild spark 1.6 or hadoop-core itself.
>
>
> 2. There's a YARN JIRA for classpath isolation in aux services: 
> https://issues.apache.org/jira/browse/YARN-4577 , with a longer term one for 
> forking off the services entirely: 
> https://issues.apache.org/jira/browse/YARN-1593 . there's a patch for the 
> first one, which, if someone want's to apply and test locally, would be 
> valued. It wouldn't ship in Hadoop until 2.9. . Patch #2 got assigned to 
> someone last week, so maybe it'll surface in Hadoop 2.9 instead/as well.
>
> 3. I'm going to open a SPARK JIRA here, cross link it to the YARN ones -so at 
> least there'll be a central record.  (Done: SPARK-12807)
>
> 4. I'll also add an "upgrade jackson" issue under HADOOP-9991, though like I 
> said: enthusiasm will be low.
>
> 5. You can D/L a version of spark 1.6 built against HDP 2.3:
> http://hortonworks.com/hadoop-tutorial/apache-spark-1-6-technical-preview-with-hdp-2-3/
>
> This isn't likely to work against HDP 2.2 BTW; later Hadoop JAR versions.
>
> -I suspect for things to work on CDH there'll be something similar. For ASF 
> Hadoop, rebuilding spark is all you have.
>
> 6. Looking @spark/master, it's been on jackson 2.5.3 since last month, from 
> SPARK-12269. Which is just going to make versioning even more traumatic. And 
> we know that amazon-aws has a back compatibility track record, so swapping 
> things around there is going to be fun. You'd probably need to rebuild 
> Hadoop-2.7.2+ with the later aws/s3 JARs to keep everything aligned
>
> on #4: has anyone found any compatibility problems if they swap out Jackson 
> 2.2.3 for Jackson 2.4.4 or 2.5.3 *without recompiling anything*? That's what 
> we need to know for Hadoop JAR updates.
>
> -Steve
>
>
>
>
>
>> 2016-01-13 11:25 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:
>>> I would suggest trying option #1 first.
>>>
>>> Thanks
>>>
>>>> On Jan 13, 2016, at 2:12 AM, Maciej Bryński <mac...@brynski.pl> wrote:
>>>>
>>>> Hi,
>>>> I/m trying to run Spark 1.6.0 on HDP 2.2
>>>> Everything was fine until I tried to turn on dynamic allocation.
>>>> According to instruction I need to add shuffle service to yarn classpath.
>>>> The problem is that HDP 2.2 has jackson 2.2.3 and Spark is using 2.4.4.
>>>> So connecting it gives error:
>>>>
>>>> 2016-01-11 16:56:51,222 INFO  containermanager.AuxServices
>>>> (AuxServices.java:addService(72)) - Adding auxiliary service
>>>> spark_shuffle, "spark_shuffle"
>>>> 2016-01-11 16:56:51,439 FATAL nodemanager.NodeManager
>>>> (NodeManager.java:initAndStartNodeManager(465)) - Error starting
>>>> NodeManager
>>>> java.lang.NoSuchMethodError:
>>>> com.fasterxml.jackson.core.JsonFactory.requiresPropertyOrdering()Z
>>>>   at 
>>>> com.fasterxml.jackson.databind.ObjectMapper.(ObjectMapper.java:457)
>>>>   at 
>>>> com.fasterxml.jackson.databind.ObjectMapper.(ObjectMapper.java:379)
>>>>   at 
>>>> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:57)
>>>>   at 
>>>> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:56)
>>>>   at 
>>>> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:128)
>>>>   at 
>>>> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>>>>   at 
>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit