Re: Performance regression for partitioned parquet data

2017-06-13 Thread Michael Allman
Hi Bertrand, I encourage you to create a ticket for this and submit a PR if you have time. Please add me as a listener, and I'll try to contribute/review. Michael > On Jun 6, 2017, at 5:18 AM, Bertrand Bossy > wrote: > > Hi, > > since moving to spark 2.1 from

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-02 Thread Michael Allman
> > On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin <r...@databricks.com > <mailto:r...@databricks.com>> wrote: > Yea I don't see why this needs to be per table config. If the user wants to > configure it per table, can't they just declare the data type on a per table >

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-01 Thread Michael Allman
I would suggest that making timestamp type behavior configurable and persisted per-table could introduce some real confusion, e.g. in queries involving tables with different timestamp type semantics. I suggest starting with the assumption that timestamp type behavior is a per-session flag that

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-25 Thread Michael Allman
PR is here: https://github.com/apache/spark/pull/18112 <https://github.com/apache/spark/pull/18112> > On May 25, 2017, at 10:28 AM, Michael Allman <mich...@videoamp.com> wrote: > > Michael, > > If you haven't started cutting the new RC, I'm working on a docume

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-25 Thread Michael Allman
Michael, If you haven't started cutting the new RC, I'm working on a documentation PR right now I'm hoping we can get into Spark 2.2 as a migration note, even if it's just a mention: https://issues.apache.org/jira/browse/SPARK-20888 . Michael

Re: Parquet vectorized reader DELTA_BYTE_ARRAY

2017-05-22 Thread Michael Allman
Hi AndreiL, Were these files written with the Parquet V2 writer? The Spark 2.1 vectorized reader does not appear to support that format. Michael > On May 9, 2017, at 11:04 AM, andreiL wrote: > > Hi, I am getting an exception in Spark 2.1 reading parquet files where

Re: Method for gracefully terminating a driver on a standalone master in Spark 2.1+

2017-05-22 Thread Michael Allman
As I cannot find a way to gracefully kill an app which takes longer than 10 seconds to shut down, I have reported this issue as a bug: https://issues.apache.org/jira/browse/SPARK-20843 <https://issues.apache.org/jira/browse/SPARK-20843> Michael > On May 4, 2017, at 4:15 PM, Micha

Method for gracefully terminating a driver on a standalone master in Spark 2.1+

2017-05-04 Thread Michael Allman
Hello, In performing our prod cluster upgrade, we've noticed that the behavior for killing a driver is more aggressive. Whereas pre-2.1 the driver runner would only call `Process.destroy`, in 2.1+ it now calls `Process.destroyForcibly` (on Java 8) if the previous `destroy` call does not return

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Michael Allman
t is related to the > SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This key and its > related functionality was absent from our previous build. The default setting > in the current build was causing Spark to attempt to scan all table files > durin

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-20 Thread Michael Allman
all table files during query analysis. Changing this setting to NEVER_INFER disabled this operation and resolved the issue we had. Michael > On Apr 20, 2017, at 3:42 PM, Michael Allman <mich...@videoamp.com> wrote: > > I want to caution that in testing a build from this morning

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-20 Thread Michael Allman
I want to caution that in testing a build from this morning's branch-2.1 we found that Hive partition pruning was not working. We found that Spark SQL was fetching all Hive table partitions for a very simple query whereas in a build from several weeks ago it was fetching only the required

Re: Implementation of RNN/LSTM in Spark

2017-02-28 Thread Michael Allman
Hi Yuhao, BigDL looks very promising and it's a framework we're considering using. It seems the general approach to high performance DL is via GPUs. Your project mentions performance on a Xeon comparable to that of a GPU, but where does this claim come from? Can you provide benchmarks?

Simple bug fix PR looking for love

2017-02-09 Thread Michael Allman
Hi Guys, Can someone help move https://github.com/apache/spark/pull/16499 along in the review process? This PR fixes replicated off-heap storage. Thanks! Michael

Re: Unique Partition Id per partition

2017-01-31 Thread Michael Allman
Hi Sumit, Can you use http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.mapPartitionsWithIndex to solve your problem? Michael > On Jan 31, 2017,

Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-31 Thread Michael Allman
That's understandable. Maybe I can help. :) What happens if you set `HIVE_TABLE_NAME = "default.employees"`? Also, does that table exist before you call `filtered_output_timestamp.write.mode("append").saveAsTable(HIVE_TABLE_NAME)`? Cheers, Michael > On Jan 29, 2017, at 9:52 PM, Chetan Khatri

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-23 Thread Michael Allman
Hi Stan, What OS/version are you using? Michael > On Jan 22, 2017, at 11:36 PM, StanZhai wrote: > > I'm using Parallel GC. > rxin wrote >> Are you using G1 GC? G1 sometimes uses a lot more memory than the size >> allocated. >> >> >> On Sun, Jan 22, 2017 at 12:58 AM

Re: GraphX-related "open" issues

2017-01-20 Thread Michael Allman
> // maropu > > On Fri, Jan 20, 2017 at 1:27 PM, Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com>> wrote: > That sounds fine to me. I think that in closing the issues, we should mention > that we're closing them because these algorith

Re: GraphX-related "open" issues

2017-01-19 Thread Michael Allman
7257: Find nearest neighbor satisfying predicate > <https://issues.apache.org/jira/browse/SPARK-7257> > - SPARK-8497: Graph Clique(Complete Connected Sub-graph) Discovery Algorithm > <https://issues.apache.org/jira/browse/SPARK-8497> > > Best, > Dongjin > > On Fri,

Re: GraphX-related "open" issues

2017-01-19 Thread Michael Allman
Regarding new GraphX algorithms, I am in agreement with the idea of publishing algorithms which are implemented using the existing API as outside packages. Regarding SPARK-10335, we have a PR for SPARK-5484 which should address the problem described in that ticket. I've reviewed that PR, but

Re: [Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Michael Allman
Personally I'd love to see some kind of pluggability, configurability in the JSON schema parsing, maybe as an option in the DataFrameReader. Perhaps you can propose an API? > On Jan 18, 2017, at 5:51 AM, Brian Hong wrote: > > I work for a mobile game company. I'm

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-18 Thread Michael Allman
Hive Metastore. > > > On Wed, Jan 18, 2017 at 2:09 PM, Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com>> wrote: > I think I understand. Partition pruning for the case where > spark.sql.hive.convertMetastoreParquet is true was not added to Spark unti

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
18, 2017 at 12:52 PM, Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com>> wrote: > What version of Spark are you running? > >> On Jan 17, 2017, at 8:42 PM, Raju Bairishetti <r...@apache.org >> <mailto:r...@apache.org>> wrote: &

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
ion: rajub.dummy[] InputPaths: > maprfs:/user/rajub/dummy/sample/year=2016/month=10, > maprfs:/user/rajub/dummy/sample/year=2016/month=11, > maprfs:/user/rajub/dummy/sample/year=2016/month=9, > maprfs:/user/rajub/dummy/sample/year=2017/month=10, > maprfs:/user/rajub/dummy/samp

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
Can you paste the actual query plan here, please? > On Jan 17, 2017, at 7:38 PM, Raju Bairishetti <r...@apache.org> wrote: > > > On Wed, Jan 18, 2017 at 11:13 AM, Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com>> wrote: > What is t

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
What is the physical query plan after you set spark.sql.hive.convertMetastoreParquet to true? Michael > On Jan 17, 2017, at 6:51 PM, Raju Bairishetti <r...@apache.org> wrote: > > Thanks Michael for the respopnse. > > > On Wed, Jan 18, 2017 at 2:45 AM, Michael Allm

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Michael Allman
Hi Raju, I'm sorry this isn't working for you. I helped author this functionality and will try my best to help. First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to false? Can you link specifically to the jira issue or spark pr you referred to? The first thing I would try

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-08 Thread Michael Allman
I believe https://github.com/apache/spark/pull/16122 needs to be included in Spark 2.1. It's a simple bug fix to some functionality that is introduced in 2.1. Unfortunately, it's been manually verified only. There's no unit test that covers it, and

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-29 Thread Michael Allman
Nov 29, 2016, at 5:15 PM, Michael Allman <mich...@videoamp.com> wrote: > > Hello, > > When I try to read from a Hive table created by Spark 2.1 in Spark 2.0 or > earlier, I get an error: > > java.lang.ClassNotFoundException: Failed to load class for data source: h

Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-29 Thread Michael Allman
Hello, When I try to read from a Hive table created by Spark 2.1 in Spark 2.0 or earlier, I get an error: java.lang.ClassNotFoundException: Failed to load class for data source: hive. Is there a way to get previous versions of Spark to read tables written with Spark 2.1? Cheers, Michael

Jackson Spark/app incompatibility and how to resolve it

2016-11-17 Thread Michael Allman
Hello, I'm running into an issue with a Spark app I'm building, which depends on a library which depends on Jackson 2.8, which fails at runtime because Spark brings in Jackson 2.6. I'm looking for a solution. As a workaround, I've patched our build of Spark to use Jackson 2.8. That's working,

Re: Updating Parquet dep to 1.9

2016-11-02 Thread Michael Allman
jira/browse/SPARK-18140> > > I think it's fine to pursue an upgrade to fix these several issues. The > question is just how well it will play with other components, so bears some > testing and evaluation of the changes from 1.8, but yes this would be good. > > On Mon, Oct 31,

Updating Parquet dep to 1.9

2016-10-31 Thread Michael Allman
Hi All, Is anyone working on updating Spark's Parquet library dep to 1.9? If not, I can at least get started on it and publish a PR. Cheers, Michael - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: help from other committers on getting started

2016-09-02 Thread Michael Allman
Hi Dayne, Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . I think you'll find answers to most of your questions there. Cheers, Michael > On Sep 2, 2016, at 8:53 AM, Dayne

Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-24 Thread Michael Allman
FYI, I've updated the issue's description to include a very simple program which reproduces the issue for me. Cheers, Michael > On Aug 23, 2016, at 4:54 PM, Michael Allman <mich...@videoamp.com> wrote: > > I've replied on the issue's page, but in a word, &quo

Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Michael Allman
problem still exist on today's master/branch-2.0? > > SPARK-16550 was merged. It might be fixed already. > > On Tue, Aug 23, 2016 at 9:37 AM, Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com>> wrote: > FYI, I posted this to user@ and have followed u

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Michael Allman
/jpmml-sparkml >>> <https://github.com/jpmml/jpmml-sparkml> - which is by now actually quite >>> comprehensive. It shows that PMML can represent a pretty large subset of >>> typical ML pipeline functionality. >>> >>> On the Python side sadly there

Re: Serving Spark ML models via a regular Python web app

2016-08-10 Thread Michael Allman
Nick, Check out MLeap: https://github.com/TrueCar/mleap . It's not python, but we use it in production to serve a random forest model trained by a Spark ML pipeline. Thanks, Michael > On Aug 10, 2016, at 7:50 PM, Nicholas Chammas

Re: Scaling partitioned Hive table support

2016-08-09 Thread Michael Allman
xecution to defer pruning. > > > On Mon, Aug 8, 2016, 11:53 AM Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com>> wrote: > Hello, > > I'd like to propose a modification in the way Hive table partition metadata > are loaded and

Re: Scaling partitioned Hive table support

2016-08-08 Thread Michael Allman
er an (unconverted) hive table. > > You might also want to look at https://github.com/apache/spark/pull/14241 > <https://github.com/apache/spark/pull/14241> , which refactors some of the > file scan execution to defer pruning. > > > On Mon, Aug 8, 2016, 11:53 AM M

Scaling partitioned Hive table support

2016-08-08 Thread Michael Allman
Hello, I'd like to propose a modification in the way Hive table partition metadata are loaded and cached. Currently, when a user reads from a partitioned Hive table whose metadata are not cached (and for which Hive table conversion is enabled and supported), all partition metadata is fetched

Re: Build speed

2016-07-22 Thread Michael Allman
I use sbt. Rebuilds are super fast. Michael > On Jul 22, 2016, at 7:54 AM, Mikael Ståldal wrote: > > Is there any way to speed up an incremental build of Spark? > > For me it takes 8 minutes to build the project with just a few code changes. > > -- > > > Mikael

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
I've run some tests with some real and some synthetic parquet data with nested columns with and without the hive metastore on our Spark 1.5, 1.6 and 2.0 versions. I haven't seen any unexpected performance surprises, except that Spark 2.0 now does schema inference across all files in a

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
Marcin, I'm not sure what you're referring to. Can you be more specific? Cheers, Michael > On Jul 20, 2016, at 9:10 AM, Marcin Tustin wrote: > > Whatever happened with the query regarding benchmarks? Is that resolved? > > On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Michael Allman
> In order for SparkSession.sqlContext to return an actual HiveContext, we'd > need to use reflection to create a HiveContext, which is pretty hacky. > > > > On Tue, Jul 19, 2016 at 10:58 AM, Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com&g

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Michael Allman
Sorry Reynold, I want to triple check this with you. I'm looking at the `SparkSession.sqlContext` field in the latest 2.0 branch, and it appears that that val is set specifically to an instance of the `SQLContext` class. A cast to `HiveContext` will fail. Maybe there's a misunderstanding here.

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Michael Allman
ot;24g") .set("spark.kryoserializer.buffer.max","1g") .set("spark.sql.codegen.wholeStage", "true") .set("spark.memory.offHeap.enabled", "true") .set("spark.memory.offHeap.size", "25769803776") // 24 GB S

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Michael Allman
t, is running Spark standalone mode, we run > and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory > fraction, 100 trials > > We can post the 1.6.2 comparison early next week, running lots of iterations > over the weekend once we get the dedicated time agai

Spark performance regression test suite

2016-07-08 Thread Michael Allman
Hello, I've seen a few messages on the mailing list regarding Spark performance concerns, especially regressions from previous versions. It got me thinking that perhaps an automated performance regression suite would be a worthwhile contribution? Is anyone working on this? Do we have a Jira

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Michael Allman
Hi Adam, Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available? Cheers, Michael > On Jul 8, 2016, at 3:17 AM, Adam Roberts wrote: > > Hi, we've been testing the performance of Spark 2.0 compared to

Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Michael Allman
FYI if you just want to look at the source code, there are source jars for those binary versions in maven central. I was just looking at the metastore source code last night. Michael > On Jul 7, 2016, at 12:13 PM, Jonathan Kelly wrote: > > I'm not sure, but I think

Re: SparkSession replace SQLContext

2016-07-05 Thread Michael Allman
These topics have been included in the documentation for recent builds of Spark 2.0. Michael > On Jul 5, 2016, at 3:49 AM, Romi Kuntsman wrote: > > You can also claim that there's a whole section of "Migrating from 1.6 to > 2.0" missing there: >

Can't build scala unidoc since Kafka 0.10 support was added

2016-07-02 Thread Michael Allman
Hello, I'm no longer able to successfully run `sbt unidoc` in branch-2.0, and the problem seems to stem from the addition of Kafka 0.10 support. If I remove either the Kafka 0.8 or 0.10 projects from the build then unidoc works. If I keep both in I get two dozen inexplicable compilation errors

Re: Bitmap Indexing to increase OLAP query performance

2016-06-30 Thread Michael Allman
Hi Nishadi, I have not seen bloom filters in Spark. They are mentioned as part of the Orc file format, but I don't know if Spark uses them: https://orc.apache.org/docs/spec-index.html. Parquet has block-level min/max values, null counts, etc for leaf columns in its metadata. I don't believe

Re: Spark 2.0 Performance drop

2016-06-29 Thread Michael Allman
g, we have a suite of Spark SQL regression tests which we run to check correctness and performance. I can share our findings when I have them. Cheers, Michael > On Jun 29, 2016, at 2:39 PM, Maciej Bryński <mac...@brynski.pl> wrote: > > 2016-06-29 23:22 GMT+02:00 Michael Allman <m

Re: Spark 2.0 Performance drop

2016-06-29 Thread Michael Allman
Hi Maciej, In Spark, projection pushdown is currently limited to top-level columns (StructFields). VideoAmp has very large parquet-based tables (many billions of records accumulated per day) with deeply nested schema (four or five levels), and we've spent a considerable amount of time

Re: Spark SQL PR looking for love...

2016-06-28 Thread Michael Allman
I should briefly mention what the PR is about... This is a patch to address a problem where non-empty partitioned Hive metastore tables are never returned in a cache lookup in HiveMetastoreCatalog.getCached. Thanks, Michael > On Jun 28, 2016, at 3:27 PM, Michael Allman <mich...@videoa

Spark SQL PR looking for love...

2016-06-28 Thread Michael Allman
Hello, Do any Spark SQL committers/experts have bandwidth to review a PR I submitted a week ago, https://github.com/apache/spark/pull/13818 ? The associated Jira ticket is https://issues.apache.org/jira/browse/SPARK-15968

Re: reading/writing parquet decimal type

2014-10-23 Thread Michael Allman
can hold fewer bytes than an int64, though many encodings of int64 can probably do the right thing. We can look into supporting multiple ways to do this -- the spec does say that you should at least be able to read int32s and int64s. Matei On Oct 12, 2014, at 8:20 PM, Michael Allman mich

Receiver/DStream storage level

2014-10-23 Thread Michael Allman
I'm implementing a custom ReceiverInputDStream and I'm not sure how to initialize the Receiver with the storage level. The storage level is set on the DStream, but there doesn't seem to be a way to pass it to the Receiver. At the same time, setting the storage level separately on the Receiver

reading/writing parquet decimal type

2014-10-12 Thread Michael Allman
Hello, I'm interested in reading/writing parquet SchemaRDDs that support the Parquet Decimal converted type. The first thing I did was update the Spark parquet dependency to version 1.5.0, as this version introduced support for decimals in parquet. However, conversion between the catalyst

Re: reading/writing parquet decimal type

2014-10-12 Thread Michael Allman
you can try this branch. See https://github.com/mateiz/spark/compare/decimal for the individual commits that went into it. It has exactly the precision stuff you need, plus some optimizations for working on decimals. Matei On Oct 12, 2014, at 1:51 PM, Michael Allman mich...@videoamp.com