Re: Speeding up Catalyst engine

2017-07-25 Thread Maciej Bryński
u > can > see the suggestion from committers on the PR. I think we don't expect it > will be merged into 2.2. > > > > Maciej Bryński wrote > > Hi Everyone, > > I'm trying to speed up my Spark streaming application and I have > following > > problem.

Speeding up Catalyst engine

2017-07-24 Thread Maciej Bryński
Hi Everyone, I'm trying to speed up my Spark streaming application and I have following problem. I'm using a lot of joins in my app and full catalyst analysis is triggered during every join. I found 2 options to speed up. 1) spark.sql.selfJoinAutoResolveAmbiguity option But looking at code: http

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-19 Thread Maciej Bryński
Oh yeah, new Spark version, new regression bugs :) https://issues.apache.org/jira/browse/SPARK-21470 M. 2017-07-17 22:01 GMT+02:00 Sam Elamin : > Well done! This is amazing news :) Congrats and really cant wait to > spread the structured streaming love! > > On Mon, Jul 17, 2017 at 5:25 PM, kan

Re: Slowness of Spark Thrift Server

2017-07-17 Thread Maciej Bryński
I did the test on Spark 2.2.0 and problem still exists. Any ideas how to fix it ? Regards, Maciek 2017-07-11 11:52 GMT+02:00 Maciej Bryński : > Hi, > I have following issue. > I'm trying to use Spark as a proxy to Cassandra. > The problem is the thrift server overhead. > &

Slowness of Spark Thrift Server

2017-07-11 Thread Maciej Bryński
Hi, I have following issue. I'm trying to use Spark as a proxy to Cassandra. The problem is the thrift server overhead. I'm using following query: select * from table where primay_key = 123 Job time (from jobs tab) is around 50ms. (and it's similar to query time from SQL tab) Unfortunately query

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Maciej Bryński
https://issues.apache.org/jira/browse/SPARK-12717 This bug is in Spark since 1.6.0. Any chance to get this fixed ? M. 2017-04-14 6:39 GMT+02:00 Holden Karau : > If it would help I'd be more than happy to look at kicking off the packaging > for RC3 since I'v been poking around in Jenkins a bit (f

Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Maciej Bryński
2017-04-06 4:00 GMT+02:00 Michael Segel : > Just out of curiosity, what would happen if you put your 10K values in to a > temp table and then did a join against it? The answer is predicates pushdown. In my case I'm using this kind of query on JDBC table and IN predicate is executed on DB in less

[Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Maciej Bryński
Hi, I'm trying to run queries with many values in IN operator. The result is that for more than 10K values IN operator is getting slower. For example this code is running about 20 seconds. df = spark.range(0,10,1,1) df.where('id in ({})'.format(','.join(map(str,range(10).count() Any

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Maciej Bryński
+1 2016-09-30 7:01 GMT+02:00 vaquar khan : > +1 (non-binding) > Regards, > Vaquar khan > > On 29 Sep 2016 23:00, "Denny Lee" wrote: > >> +1 (non-binding) >> >> On Thu, Sep 29, 2016 at 9:43 PM Jeff Zhang wrote: >> >>> +1 >>> >>> On Fri, Sep 30, 2016 at 9:27 AM, Burak Yavuz wrote: >>> +1 >

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Maciej Bryński
+1 At last :) 2016-09-26 19:56 GMT+02:00 Sameer Agarwal : > +1 (non-binding) > > On Mon, Sep 26, 2016 at 9:54 AM, Davies Liu wrote: > >> +1 (non-binding) >> >> On Mon, Sep 26, 2016 at 9:36 AM, Joseph Bradley >> wrote: >> > +1 >> > >> > On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee >> wrote: >> >>

Re: Performance of loading parquet files into case classes in Spark

2016-08-28 Thread Maciej Bryński
ow, it seems that it got much slower from 1.6 to > 2.0. I guess, it's because of the fact that Dataframe is now Dataset[Row], > and thus uses the same encoding/decoding mechanism as for any other case > class. > > Best regards, > > Julien > > Le 27 août 2016 à 22:32, M

Cache'ing performance

2016-08-27 Thread Maciej Bryński
Hi, I did some benchmark of cache function today. *RDD* sc.parallelize(0 until Int.MaxValue).cache().count() *Datasets* spark.range(Int.MaxValue).cache().count() For me Datasets was 2 times slower. Results (3 nodes, 20 cores and 48GB RAM each) *RDD - 6s* *Datasets - 13,5 s* Is that expected be

Re: Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Maciej Bryński
2016-08-27 15:27 GMT+02:00 Julien Dumazert : > df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _) I think reduce and sum has very different performance. Did you try sql.functions.sum ? Or of you want to benchmark access to Row object then count() function will be better idea. Regards,

Re: Tree for SQL Query

2016-08-24 Thread Maciej Bryński
016-08-24 22:39 GMT+02:00 Reynold Xin : > It's basically the output of the explain command. > > > On Wed, Aug 24, 2016 at 12:31 PM, Maciej Bryński wrote: >> >> Hi, >> I read this article: >> >> https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sql

Tree for SQL Query

2016-08-24 Thread Maciej Bryński
Hi, I read this article: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html And I have a question. Is it possible to get / print Tree for SQL Query ? Something like this: Add(Attribute(x), Add(Literal(1), Literal(2))) Regards, -- Maciek Bryński --

Re: GraphFrames 0.2.0 released

2016-08-24 Thread Maciej Bryński
Hi, Do you plan to add tag for this release on github ? https://github.com/graphframes/graphframes/releases Regards, Maciek 2016-08-17 3:18 GMT+02:00 Jacek Laskowski : > Hi Tim, > > AWESOME. Thanks a lot for releasing it. That makes me even more eager > to see it in Spark's codebase (and replaci

Re: Result code of whole stage codegen

2016-08-05 Thread Maciej Bryński
inal class GeneratedIterator extends > org.apache.spark.sql.execution.BufferedRowIterator > { > > /* 006 */ private Object[] references; > > /* 007 */ private org.apache.spark.sql.execution.metric.SQLMetric > range_numOutputRows; > > /* 008 */ private boolean range_initRang

Re: Spark SQL and Kryo registration

2016-08-05 Thread Maciej Bryński
Hi Olivier, Did you check performance of Kryo ? I have observations that Kryo is slightly slower than Java Serializer. Regards, Maciek 2016-08-04 17:41 GMT+02:00 Amit Sela : > It should. Codegen uses the SparkConf in SparkEnv when instantiating a new > Serializer. > > On Thu, Aug 4, 2016 at 6:14

Result code of whole stage codegen

2016-08-05 Thread Maciej Bryński
Hi, I have some operation on DataFrame / Dataset. How can I see source code for whole stage codegen ? Is there any API for this ? Or maybe I should configure log4j in specific way ? Regards, -- Maciek Bryński

Re: Spark jdbc update SaveMode

2016-07-22 Thread Maciej Bryński
2016-07-22 23:05 GMT+02:00 Ramon Rosa da Silva : > Hi Folks, > > > > What do you think about allow update SaveMode from > DataFrame.write.mode(“update”)? > > Now Spark just has jdbc insert. I'm working on patch that creates new mode - 'upsert'. In Mysql it will use 'REPLACE INTO' command. M. ---

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Maciej Bryński
@Michael, I answered in Jira and could repeat here. I think that my problem is unrelated to Hive, because I'm using read.parquet method. I also attached some VisualVM snapshots to SPARK-16321 (I think I should merge both issues) And code profiling suggest bottleneck when reading parquet file. I wo

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Maciej Bryński
@Reynold Xin, How this will work with Hive Support ? SparkSession.sqlContext return HiveContext ? 2016-07-19 0:26 GMT+02:00 Reynold Xin : > Good idea. > > https://github.com/apache/spark/pull/14252 > > > > On Mon, Jul 18, 2016 at 12:16 PM, Michael Armbrust > wrote: >> >> + dev, reynold >> >> Yeah

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-19 Thread Maciej Bryński
@Sean Owen, As we're not planning to implement DataSets in Python do you plan to revert this Jira ? https://issues.apache.org/jira/browse/SPARK-13594 2016-07-19 10:07 GMT+02:00 Sean Owen : > I think unfortunately at least this one is gonna block: > https://issues.apache.org/jira/browse/SPARK-1662

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Maciej Bryński
-1 https://issues.apache.org/jira/browse/SPARK-16379 https://issues.apache.org/jira/browse/SPARK-16371 2016-07-06 7:35 GMT+02:00 Reynold Xin : > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and passes > i

Re: Spark 2.0 Performance drop

2016-06-30 Thread Maciej Bryński
t; have them. > > Cheers, > > Michael > >> On Jun 29, 2016, at 2:39 PM, Maciej Bryński wrote: >> >> 2016-06-29 23:22 GMT+02:00 Michael Allman : >>> I'm sorry I don't have any concrete advice for you, but I hope this helps >>> shed some l

Re: Spark 2.0 Performance drop

2016-06-29 Thread Maciej Bryński
2016-06-29 23:22 GMT+02:00 Michael Allman : > I'm sorry I don't have any concrete advice for you, but I hope this helps > shed some light on the current support in Spark for projection pushdown. > > Michael Michael, Thanks for the answer. This resolves one of my questions. Which Spark version you

Spark 2.0 Performance drop

2016-06-29 Thread Maciej Bryński
Hi, Did anyone measure performance of Spark 2.0 vs Spark 1.6 ? I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes 2x slower. I tested following queries: 1) select count(*) where id > some_id In this query we have PPD and performance i

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-23 Thread Maciej Bryński
-1 I need SPARK-13283 to be solved. Regards, Maciek Bryński 2016-06-23 0:13 GMT+02:00 Krishna Sankar : > +1 (non-binding, of course) > > 1. Compiled OSX 10.10 (Yosemite) OK Total time: 37:11 min > mvn clean package -Pyarn -Phadoop-2.6 -DskipTests > 2. Tested pyspark, mllib (iPython 4.0) >

Re: Spark 1.6.0 + Hive + HBase

2016-01-28 Thread Maciej Bryński
ogress. > > Cheers > > On Jan 28, 2016, at 1:14 AM, Maciej Bryński wrote: > > Hi, > I'm trying to run SQL query on Hive table which is stored on HBase. > I'm using: > - Spark 1.6.0 > - HDP 2.2 > - Hive 0.14.0 > - HBase 0.98.4 > > I managed to

Spark 1.6.0 + Hive + HBase

2016-01-28 Thread Maciej Bryński
Hi, I'm trying to run SQL query on Hive table which is stored on HBase. I'm using: - Spark 1.6.0 - HDP 2.2 - Hive 0.14.0 - HBase 0.98.4 I managed to configure working classpath, but I have following problems: 1) I have UDF defined in Hive Metastore (FUNCS table). Spark cannot use it.. File "/op

Re: Spark 1.6.0 and HDP 2.2 - problem

2016-01-13 Thread Maciej Bryński
Steve, Thank you for the answer. How Hortonworks deal with this problem internally ? You have Spark 1.3.1 in HDP 2.3. Is it compilled with Jackson 2.2.3 ? Regards, Maciek 2016-01-13 18:00 GMT+01:00 Steve Loughran : > >> On 13 Jan 2016, at 03:23, Maciej Bryński wrote: >>

Re: Spark 1.6.0 and HDP 2.2 - problem

2016-01-13 Thread Maciej Bryński
Thanks. I successfully compiled Spark 1.6.0 with Jackson 2.2.3 from source. I'll try to using it. 2016-01-13 11:25 GMT+01:00 Ted Yu : > I would suggest trying option #1 first. > > Thanks > >> On Jan 13, 2016, at 2:12 AM, Maciej Bryński wrote: >> >> Hi, >

Spark 1.6.0 and HDP 2.2 - problem

2016-01-13 Thread Maciej Bryński
Hi, I/m trying to run Spark 1.6.0 on HDP 2.2 Everything was fine until I tried to turn on dynamic allocation. According to instruction I need to add shuffle service to yarn classpath. The problem is that HDP 2.2 has jackson 2.2.3 and Spark is using 2.4.4. So connecting it gives error: 2016-01-11 1