Re: Performance regression for partitioned parquet data

2017-06-13 Thread Michael Allman
Hi Bertrand, I encourage you to create a ticket for this and submit a PR if you have time. Please add me as a listener, and I'll try to contribute/review. Michael > On Jun 6, 2017, at 5:18 AM, Bertrand Bossy > wrote: > > Hi, > > since moving to spark 2.1 from 2.0, we experience a performanc

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-02 Thread Michael Allman
; > On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin <mailto:r...@databricks.com>> wrote: > Yea I don't see why this needs to be per table config. If the user wants to > configure it per table, can't they just declare the data type on a per table > basis, once we have

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-01 Thread Michael Allman
I would suggest that making timestamp type behavior configurable and persisted per-table could introduce some real confusion, e.g. in queries involving tables with different timestamp type semantics. I suggest starting with the assumption that timestamp type behavior is a per-session flag that

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-25 Thread Michael Allman
PR is here: https://github.com/apache/spark/pull/18112 <https://github.com/apache/spark/pull/18112> > On May 25, 2017, at 10:28 AM, Michael Allman wrote: > > Michael, > > If you haven't started cutting the new RC, I'm working on a documentation PR > ri

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-25 Thread Michael Allman
Michael, If you haven't started cutting the new RC, I'm working on a documentation PR right now I'm hoping we can get into Spark 2.2 as a migration note, even if it's just a mention: https://issues.apache.org/jira/browse/SPARK-20888 . Michael

Re: Parquet vectorized reader DELTA_BYTE_ARRAY

2017-05-22 Thread Michael Allman
Hi AndreiL, Were these files written with the Parquet V2 writer? The Spark 2.1 vectorized reader does not appear to support that format. Michael > On May 9, 2017, at 11:04 AM, andreiL wrote: > > Hi, I am getting an exception in Spark 2.1 reading parquet files where some > columns are DELTA_B

Re: Method for gracefully terminating a driver on a standalone master in Spark 2.1+

2017-05-22 Thread Michael Allman
As I cannot find a way to gracefully kill an app which takes longer than 10 seconds to shut down, I have reported this issue as a bug: https://issues.apache.org/jira/browse/SPARK-20843 <https://issues.apache.org/jira/browse/SPARK-20843> Michael > On May 4, 2017, at 4:15 PM, Micha

Method for gracefully terminating a driver on a standalone master in Spark 2.1+

2017-05-04 Thread Michael Allman
Hello, In performing our prod cluster upgrade, we've noticed that the behavior for killing a driver is more aggressive. Whereas pre-2.1 the driver runner would only call `Process.destroy`, in 2.1+ it now calls `Process.destroyForcibly` (on Java 8) if the previous `destroy` call does not return

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Michael Allman
hanks for pointing this out, Michael. Based on the conversation on the PR > <https://github.com/apache/spark/pull/16944#issuecomment-285529275> this > seems like a risky change to include in a release branch with a default other > than NEVER_INFER. > > +Wenchen? What do you th

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-20 Thread Michael Allman
scan all table files during query analysis. Changing this setting to NEVER_INFER disabled this operation and resolved the issue we had. Michael > On Apr 20, 2017, at 3:42 PM, Michael Allman wrote: > > I want to caution that in testing a build from this morning's branch-2.1 we

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-20 Thread Michael Allman
I want to caution that in testing a build from this morning's branch-2.1 we found that Hive partition pruning was not working. We found that Spark SQL was fetching all Hive table partitions for a very simple query whereas in a build from several weeks ago it was fetching only the required partit

Re: Implementation of RNN/LSTM in Spark

2017-02-28 Thread Michael Allman
Hi Yuhao, BigDL looks very promising and it's a framework we're considering using. It seems the general approach to high performance DL is via GPUs. Your project mentions performance on a Xeon comparable to that of a GPU, but where does this claim come from? Can you provide benchmarks? Thanks,

Simple bug fix PR looking for love

2017-02-09 Thread Michael Allman
Hi Guys, Can someone help move https://github.com/apache/spark/pull/16499 along in the review process? This PR fixes replicated off-heap storage. Thanks! Michael

Re: Unique Partition Id per partition

2017-01-31 Thread Michael Allman
Hi Sumit, Can you use http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.mapPartitionsWithIndex to solve your problem? Michael > On Jan 31, 2017,

Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-31 Thread Michael Allman
That's understandable. Maybe I can help. :) What happens if you set `HIVE_TABLE_NAME = "default.employees"`? Also, does that table exist before you call `filtered_output_timestamp.write.mode("append").saveAsTable(HIVE_TABLE_NAME)`? Cheers, Michael > On Jan 29, 2017, at 9:52 PM, Chetan Khatri

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-23 Thread Michael Allman
Hi Stan, What OS/version are you using? Michael > On Jan 22, 2017, at 11:36 PM, StanZhai wrote: > > I'm using Parallel GC. > rxin wrote >> Are you using G1 GC? G1 sometimes uses a lot more memory than the size >> allocated. >> >> >> On Sun, Jan 22, 2017 at 12:58 AM StanZhai < > >> mail@ >

Re: GraphX-related "open" issues

2017-01-20 Thread Michael Allman
t; On Fri, Jan 20, 2017 at 1:27 PM, Michael Allman <mailto:mich...@videoamp.com>> wrote: > That sounds fine to me. I think that in closing the issues, we should mention > that we're closing them because these algorithms can be implemented using the > existing API. > &g

Re: GraphX-related "open" issues

2017-01-19 Thread Michael Allman
nearest neighbor satisfying predicate > <https://issues.apache.org/jira/browse/SPARK-7257> > - SPARK-8497: Graph Clique(Complete Connected Sub-graph) Discovery Algorithm > <https://issues.apache.org/jira/browse/SPARK-8497> > > Best, > Dongjin > > On Fri, Jan 20, 20

Re: GraphX-related "open" issues

2017-01-19 Thread Michael Allman
Regarding new GraphX algorithms, I am in agreement with the idea of publishing algorithms which are implemented using the existing API as outside packages. Regarding SPARK-10335, we have a PR for SPARK-5484 which should address the problem described in that ticket. I've reviewed that PR, but bec

Re: [Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Michael Allman
Personally I'd love to see some kind of pluggability, configurability in the JSON schema parsing, maybe as an option in the DataFrameReader. Perhaps you can propose an API? > On Jan 18, 2017, at 5:51 AM, Brian Hong wrote: > > I work for a mobile game company. I'm solving a simple question: "Ca

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-18 Thread Michael Allman
> > > On Wed, Jan 18, 2017 at 2:09 PM, Michael Allman <mailto:mich...@videoamp.com>> wrote: > I think I understand. Partition pruning for the case where > spark.sql.hive.convertMetastoreParquet is true was not added to Spark until >

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
wever, loading all of that partition metadata can be quite slow for very large tables. I'm sorry I can't think of a better solution for you. Michael > On Jan 17, 2017, at 8:59 PM, Raju Bairishetti wrote: > > Tested on both 1.5.2 and 1.61. > > On Wed, Jan 18, 20

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
InputPaths: > maprfs:/user/rajub/dummy/sample/year=2016/month=10, > maprfs:/user/rajub/dummy/sample/year=2016/month=11, > maprfs:/user/rajub/dummy/sample/year=2016/month=9, > maprfs:/user/rajub/dummy/sample/year=2017/month=10, > maprfs:/user/rajub/dummy/sample/year=2017/month=

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
Can you paste the actual query plan here, please? > On Jan 17, 2017, at 7:38 PM, Raju Bairishetti wrote: > > > On Wed, Jan 18, 2017 at 11:13 AM, Michael Allman <mailto:mich...@videoamp.com>> wrote: > What is the physical query plan after you set > spark.sql.hive

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
What is the physical query plan after you set spark.sql.hive.convertMetastoreParquet to true? Michael > On Jan 17, 2017, at 6:51 PM, Raju Bairishetti wrote: > > Thanks Michael for the respopnse. > > > On Wed, Jan 18, 2017 at 2:45 AM, Michael Allman <mailto:mich...

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
Hi Raju, I'm sorry this isn't working for you. I helped author this functionality and will try my best to help. First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to false? Can you link specifically to the jira issue or spark pr you referred to? The first thing I would try i

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-08 Thread Michael Allman
I believe https://github.com/apache/spark/pull/16122 needs to be included in Spark 2.1. It's a simple bug fix to some functionality that is introduced in 2.1. Unfortunately, it's been manually verified only. There's no unit test that covers it, and b

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-29 Thread Michael Allman
t; On Nov 29, 2016, at 5:15 PM, Michael Allman wrote: > > Hello, > > When I try to read from a Hive table created by Spark 2.1 in Spark 2.0 or > earlier, I get an error: > > java.lang.ClassNotFoundException: Failed to load class for data source: hive. > > Is there a

Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-29 Thread Michael Allman
Hello, When I try to read from a Hive table created by Spark 2.1 in Spark 2.0 or earlier, I get an error: java.lang.ClassNotFoundException: Failed to load class for data source: hive. Is there a way to get previous versions of Spark to read tables written with Spark 2.1? Cheers, Michael

Jackson Spark/app incompatibility and how to resolve it

2016-11-17 Thread Michael Allman
Hello, I'm running into an issue with a Spark app I'm building, which depends on a library which depends on Jackson 2.8, which fails at runtime because Spark brings in Jackson 2.6. I'm looking for a solution. As a workaround, I've patched our build of Spark to use Jackson 2.8. That's working, h

Re: Updating Parquet dep to 1.9

2016-11-02 Thread Michael Allman
upgrade to fix these several issues. The > question is just how well it will play with other components, so bears some > testing and evaluation of the changes from 1.8, but yes this would be good. > > On Mon, Oct 31, 2016 at 9:0

Updating Parquet dep to 1.9

2016-10-31 Thread Michael Allman
Hi All, Is anyone working on updating Spark's Parquet library dep to 1.9? If not, I can at least get started on it and publish a PR. Cheers, Michael - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: help from other committers on getting started

2016-09-02 Thread Michael Allman
Hi Dayne, Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . I think you'll find answers to most of your questions there. Cheers, Michael > On Sep 2, 2016, at 8:53 AM, Dayne Sorvis

Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-24 Thread Michael Allman
FYI, I've updated the issue's description to include a very simple program which reproduces the issue for me. Cheers, Michael > On Aug 23, 2016, at 4:54 PM, Michael Allman wrote: > > I've replied on the issue's page, but in a word, "yes". See > h

Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Michael Allman
exist on today's master/branch-2.0? > > SPARK-16550 was merged. It might be fixed already. > > On Tue, Aug 23, 2016 at 9:37 AM, Michael Allman <mailto:mich...@videoamp.com>> wrote: > FYI, I posted this to user@ and have followed up with a bug report: > https://issu

Fwd: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Michael Allman
FYI, I posted this to user@ and have followed up with a bug report: https://issues.apache.org/jira/browse/SPARK-17204 <https://issues.apache.org/jira/browse/SPARK-17204> Michael > Begin forwarded message: > > From: Michael Allman > Subject: Anyone else having trouble with r

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Michael Allman
actually quite >>> comprehensive. It shows that PMML can represent a pretty large subset of >>> typical ML pipeline functionality. >>> >>> On the Python side sadly there is even less - I would say your options are >>> pretty much "roll your ow

Re: Serving Spark ML models via a regular Python web app

2016-08-10 Thread Michael Allman
Nick, Check out MLeap: https://github.com/TrueCar/mleap . It's not python, but we use it in production to serve a random forest model trained by a Spark ML pipeline. Thanks, Michael > On Aug 10, 2016, at 7:50 PM, Nicholas Chammas > wrote: > > Are there any

Re: Scaling partitioned Hive table support

2016-08-09 Thread Michael Allman
pruning. > > > On Mon, Aug 8, 2016, 11:53 AM Michael Allman <mailto:mich...@videoamp.com>> wrote: > Hello, > > I'd like to propose a modification in the way Hive table partition metadata > are loaded and cached. Currently, when a user reads from a partitioned H

Re: Scaling partitioned Hive table support

2016-08-08 Thread Michael Allman
verted) hive table. > > You might also want to look at https://github.com/apache/spark/pull/14241 > <https://github.com/apache/spark/pull/14241> , which refactors some of the > file scan execution to defer pruning. > > > On Mon, Aug 8, 2016, 11:53 AM Michael Allman <m

Scaling partitioned Hive table support

2016-08-08 Thread Michael Allman
Hello, I'd like to propose a modification in the way Hive table partition metadata are loaded and cached. Currently, when a user reads from a partitioned Hive table whose metadata are not cached (and for which Hive table conversion is enabled and supported), all partition metadata is fetched fr

Re: Build speed

2016-07-22 Thread Michael Allman
I use sbt. Rebuilds are super fast. Michael > On Jul 22, 2016, at 7:54 AM, Mikael Ståldal wrote: > > Is there any way to speed up an incremental build of Spark? > > For me it takes 8 minutes to build the project with just a few code changes. > > -- > > > Mikael Ståldal > Senior software d

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
I've run some tests with some real and some synthetic parquet data with nested columns with and without the hive metastore on our Spark 1.5, 1.6 and 2.0 versions. I haven't seen any unexpected performance surprises, except that Spark 2.0 now does schema inference across all files in a partitione

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
30 June 2016 to this list. He said that his benchmarking > suggested that Spark 2.0 was slower than 1.6. > > I'm wondering if that was ever investigated, and if so if the speed is back > up, or not. > > On Wed, Jul 20, 2016 at 12:18 PM, Michael Allman <mailto:mich...@vi

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
Marcin, I'm not sure what you're referring to. Can you be more specific? Cheers, Michael > On Jul 20, 2016, at 9:10 AM, Marcin Tustin wrote: > > Whatever happened with the query regarding benchmarks? Is that resolved? > > On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Michael Allman
der for SparkSession.sqlContext to return an actual HiveContext, we'd > need to use reflection to create a HiveContext, which is pretty hacky. > > > > On Tue, Jul 19, 2016 at 10:58 AM, Michael Allman <mailto:mich...@videoamp.com>> wrote: > Sorry Reynold, I want t

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Michael Allman
Sorry Reynold, I want to triple check this with you. I'm looking at the `SparkSession.sqlContext` field in the latest 2.0 branch, and it appears that that val is set specifically to an instance of the `SQLContext` class. A cast to `HiveContext` will fail. Maybe there's a misunderstanding here. T

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Michael Allman
ot;24g") .set("spark.kryoserializer.buffer.max","1g") .set("spark.sql.codegen.wholeStage", "true") .set("spark.memory.offHeap.enabled", "true") .set("spark.memory.offHeap.size", "25769803776") // 24 GB

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Michael Allman
Spark standalone mode, we run > and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory > fraction, 100 trials > > We can post the 1.6.2 comparison early next week, running lots of iterations > over the weekend once we get the dedicated time again &g

Spark performance regression test suite

2016-07-08 Thread Michael Allman
Hello, I've seen a few messages on the mailing list regarding Spark performance concerns, especially regressions from previous versions. It got me thinking that perhaps an automated performance regression suite would be a worthwhile contribution? Is anyone working on this? Do we have a Jira iss

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Michael Allman
Hi Adam, Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available? Cheers, Michael > On Jul 8, 2016, at 3:17 AM, Adam Roberts wrote: > > Hi, we've been testing the performance of Spark 2.0 compared to previous > releases, unfort

Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Michael Allman
FYI if you just want to look at the source code, there are source jars for those binary versions in maven central. I was just looking at the metastore source code last night. Michael > On Jul 7, 2016, at 12:13 PM, Jonathan Kelly wrote: > > I'm not sure, but I think it's > https://github.com/

Re: SparkSession replace SQLContext

2016-07-05 Thread Michael Allman
These topics have been included in the documentation for recent builds of Spark 2.0. Michael > On Jul 5, 2016, at 3:49 AM, Romi Kuntsman wrote: > > You can also claim that there's a whole section of "Migrating from 1.6 to > 2.0" missing there: > https://spark.apache.org/docs/2.0.0-preview/sql

Can't build scala unidoc since Kafka 0.10 support was added

2016-07-02 Thread Michael Allman
Hello, I'm no longer able to successfully run `sbt unidoc` in branch-2.0, and the problem seems to stem from the addition of Kafka 0.10 support. If I remove either the Kafka 0.8 or 0.10 projects from the build then unidoc works. If I keep both in I get two dozen inexplicable compilation errors

Re: Bitmap Indexing to increase OLAP query performance

2016-06-30 Thread Michael Allman
Hi Nishadi, I have not seen bloom filters in Spark. They are mentioned as part of the Orc file format, but I don't know if Spark uses them: https://orc.apache.org/docs/spec-index.html. Parquet has block-level min/max values, null counts, etc for leaf columns in its metadata. I don't believe Sp

Re: Spark 2.0 Performance drop

2016-06-29 Thread Michael Allman
g benchmarking, we have a suite of Spark SQL regression tests which we run to check correctness and performance. I can share our findings when I have them. Cheers, Michael > On Jun 29, 2016, at 2:39 PM, Maciej Bryński wrote: > > 2016-06-29 23:22 GMT+02:00 Michael Allman : >> I&

Re: Spark 2.0 Performance drop

2016-06-29 Thread Michael Allman
Hi Maciej, In Spark, projection pushdown is currently limited to top-level columns (StructFields). VideoAmp has very large parquet-based tables (many billions of records accumulated per day) with deeply nested schema (four or five levels), and we've spent a considerable amount of time optimizin

Re: Spark SQL PR looking for love...

2016-06-28 Thread Michael Allman
I should briefly mention what the PR is about... This is a patch to address a problem where non-empty partitioned Hive metastore tables are never returned in a cache lookup in HiveMetastoreCatalog.getCached. Thanks, Michael > On Jun 28, 2016, at 3:27 PM, Michael Allman wrote: > &

Spark SQL PR looking for love...

2016-06-28 Thread Michael Allman
Hello, Do any Spark SQL committers/experts have bandwidth to review a PR I submitted a week ago, https://github.com/apache/spark/pull/13818 ? The associated Jira ticket is https://issues.apache.org/jira/browse/SPARK-15968

Spark SQL ExternalSorter not stopped

2015-03-19 Thread Michael Allman
I've examined the experimental support for ExternalSorter in Spark SQL, and it does not appear that the external sorted is ever stopped (ExternalSorter.stop). According to the API documentation, this suggests a resource leak. Before I file a bug report in Jira, can someone familiar with the code

Receiver/DStream storage level

2014-10-23 Thread Michael Allman
I'm implementing a custom ReceiverInputDStream and I'm not sure how to initialize the Receiver with the storage level. The storage level is set on the DStream, but there doesn't seem to be a way to pass it to the Receiver. At the same time, setting the storage level separately on the Receiver se

Re: reading/writing parquet decimal type

2014-10-23 Thread Michael Allman
ld fewer bytes than an int64, though many > encodings of int64 can probably do the right thing. We can look into > supporting multiple ways to do this -- the spec does say that you should at > least be able to read int32s and int64s. > > Matei > > On Oct 12, 2014, at 8:20

Re: reading/writing parquet decimal type

2014-10-12 Thread Michael Allman
res soon, but meanwhile you can try this branch. See > https://github.com/mateiz/spark/compare/decimal for the individual commits > that went into it. It has exactly the precision stuff you need, plus some > optimizations for working on decimals. > > Matei > > On Oct 12, 201

reading/writing parquet decimal type

2014-10-12 Thread Michael Allman
Hello, I'm interested in reading/writing parquet SchemaRDDs that support the Parquet Decimal converted type. The first thing I did was update the Spark parquet dependency to version 1.5.0, as this version introduced support for decimals in parquet. However, conversion between the catalyst decim