Possible bug in DatasourceV2

2018-10-10 Thread assaf.mendelson
Hi, I created a datasource writer WITHOUT a reader. When I do, I get an exception: org.apache.spark.sql.AnalysisException: Data source is not readable: DefaultSource The reason for this is that when save is called, inside the source match to WriterSupport we have the following code: val source =

Docker image to build Spark/Spark doc

2018-10-10 Thread assaf.mendelson
Hi all, I was wondering if there was a docker image to build spark and/or spark documentation The idea would be that I would start the docker image, supplying the directory with my code and a target directory and it would simply build everything (maybe with some options). Any chance there is alre

DataSourceV2 documentation & tutorial

2018-10-08 Thread assaf.mendelson
Hi all, I have been working on a legacy datasource integration with data source V2 for the last couple of week including upgrading it to the Spark 2.4.0 RC. During this process I wrote a tutorial with explanation on how to create a new datasource (it can be found in https://github.com/assafmendel

Re: Data source V2 in spark 2.4.0

2018-10-04 Thread assaf.mendelson
Thanks for the info. I have been converting an internal data source to V2 and am now preparing it for 2.4.0. I have a couple of suggestions from my experience so far. First I believe we are missing documentation on this. I am currently writing an internal tutorial based on what I am learning, I

Data source V2 in spark 2.4.0

2018-10-01 Thread assaf.mendelson
Hi all, I understood from previous threads that the Data source V2 API will see some changes in spark 2.4.0, however, I can't seem to find what these changes are. Is there some documentation which summarizes the changes? The only mention I seem to find is this pull request: https://github.com/apa

Re: Why is SQLImplicits an abstract class rather than a trait?

2018-08-06 Thread assaf.mendelson
The import will work for the trait but not for anyone implementing the trait. As for not having a master, it was just an example, the full example contains some configurations. Thanks, Assaf -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ -

Why is SQLImplicits an abstract class rather than a trait?

2018-08-05 Thread assaf.mendelson
Hi all, I have been playing a bit with SQLImplicits and noticed that it is an abstract class. I was wondering why is that? It has no constructor. Because of it being an abstract class it means that adding a test trait cannot extend it and still be a trait. Consider the following: trait MySp

Data source V2

2018-07-30 Thread assaf.mendelson
Hi all, I am currently in the middle of developing a new data source (for an internal tool) using data source V2. I noticed that SPARK-24882 is planned for 2.4 and includes interface changes. I was wondering if those are planned in addition to

Re: Spark data source resiliency

2018-07-03 Thread assaf.mendelson
You are correct, this solved it. Thanks -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark data source resiliency

2018-07-02 Thread assaf.mendelson
That is what I expected, however, I did a very simple test (using println just to see when the exception is triggered in the iterator) using local master and I saw it failed once and cause the entire operation to fail. Is this something which may be unique to local master (or some default configur

Spark data source resiliency

2018-07-02 Thread assaf.mendelson
Hi All, I am implemented a data source V2 which integrates with an internal system and I need to make it resilient to errors in the internal data source. The issue is that currently, if there is an exception in the data reader, the exception seems to fail the entire task. I would prefer instead t

Possible bug: inconsistent timestamp behavior

2017-08-15 Thread assaf.mendelson
Hi all, I encountered weird behavior for timestamp. It seems that when using lit to add it to column, the timestamp goes from milliseconds representation to seconds representation: scala> spark.range(1).withColumn("a", lit(new java.sql.Timestamp(148550335L)).cast("long")).show() +---+-

RE: [SS] Why does ConsoleSink's addBatch convert input DataFrame to show it?

2017-07-07 Thread assaf.mendelson
I actually asked the same thing a couple of weeks ago. Apparently, when you create a structured streaming plan, it is different than the batch plan and is fixed in order to properly aggregate. If you perform most operations on the dataframe it will recalculate the plan as a batch plan and will t

RE: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-26 Thread assaf.mendelson
Not a show stopper, however, I was looking at the structured streaming programming guide and under arbitrary stateful operations (https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/structured-streaming-programming-guide.html#arbitrary-stateful-operations) the suggestion is t

cannot call explain or show on dataframe in structured streaming addBatch dataframe

2017-06-19 Thread assaf.mendelson
Hi all, I am playing around with structured streaming and looked at the code for ConsoleSink. I see the code has: data.sparkSession.createDataFrame( data.sparkSession.sparkContext.parallelize(data.collect()), data.schema) .show(numRowsToShow, isTruncated) } I was wondering why it does

RE: [PYTHON] PySpark typing hints

2017-05-23 Thread assaf.mendelson
Actually there is, at least for pycharm. I actually opened a jira on it (https://issues.apache.org/jira/browse/SPARK-17333). It describes two way of doing it (I also made a github stub at: https://github.com/assafmendelson/ExamplePysparkAnnotation). Unfortunately, I never found the time to foll

RE: Will .count() always trigger an evaluation of each row?

2017-02-19 Thread assaf.mendelson
less efficient. On 19 Feb 2017, at 10:13, assaf.mendelson <[hidden email]> wrote: Actually, when I did a simple test on parquet (spark.read.parquet(“somefile”).cache().count()) the UI showed me that the entire file is cached. Is this just a fluke? In any case I believe the question is

RE: Will .count() always trigger an evaluation of each row?

2017-02-19 Thread assaf.mendelson
Actually, when I did a simple test on parquet (spark.read.parquet(“somefile”).cache().count()) the UI showed me that the entire file is cached. Is this just a fluke? In any case I believe the question is still valid, how to make sure a dataframe is cached. Consider for example a case where we r

spark support on windows

2017-01-16 Thread assaf.mendelson
Hi, In the documentation it says spark is supported on windows. The problem, however, is that the documentation description on windows is lacking. There are sources (such as https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html and man

RE: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread assaf.mendelson
recursive structure in a similar manner to described here http://stackoverflow.com/q/34461804 You can try something like this http://stackoverflow.com/a/37612978 but there is of course on overhead of conversion between Dataset and RDD. On 12/29/2016 06:21 PM, assaf.mendelson wrote: Hi, I have

repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread assaf.mendelson
Hi, I have been playing around with doing union between a large number of dataframes and saw that the performance of the actual union (not the action) is worse than O(N^2). Since a union basically defines a lineage (i.e. current + union with of other as a child) this should be almost instantane

RE: Shuffle intermidiate results not being cached

2016-12-27 Thread assaf.mendelson
us dataframes in each iteration. You just need to get the summary and union it with new dataframe to compute the newer aggregation summary in next iteration. It is more similar to streaming case, I don't think you can/should recompute all the data since the beginning of a stream. assaf.

RE: Shuffle intermidiate results not being cached

2016-12-26 Thread assaf.mendelson
The reason I thought some operations would be reused is the fact that spark automatically caches shuffle data which means the partial aggregation for pervious dataframes would be saved. Unfortunatly, as Mark Hamstra explained this is not the case because this is considered a new RDD and therefor

Shuffle intermidiate results not being cached

2016-12-26 Thread assaf.mendelson
Hi, Sorry to be bothering everyone on the holidays but I have found what may be a bug. I am doing a "manual" streaming (see http://stackoverflow.com/questions/41266956/apache-spark-streaming-performance for the specific code) where I essentially read an additional dataframe each time from fil

RE: Aggregating over sorted data

2016-12-22 Thread assaf.mendelson
It seems that this aggregation is for dataset operations only. I would have hoped to be able to do dataframe aggregation. Something along the line of: sort_df(df).agg(my_agg_func) In any case, note that this kind of sorting is less efficient than the sorting done in window functions for example

RE: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread assaf.mendelson
I may be mistaken but if I remember correctly spark behaves differently when it is bounded in the past and when it is not. Specifically I seem to recall a fix which made sure that when there is no lower bound then the aggregation is done one by one instead of doing the whole range for each windo

RE: Handling questions in the mailing lists

2016-11-24 Thread assaf.mendelson
it's still not technically critical mass after 2 years. It would just fracture the discussion to yet another place. On Thu, Nov 24, 2016 at 6:52 AM assaf.mendelson <[hidden email]> wrote: Sorry to reawaken this, but I just noticed it is possible to propose new topic specific sites (http://are

RE: Handling questions in the mailing lists

2016-11-23 Thread assaf.mendelson
to the /community.html page. (We're going to migrate the wiki real soon now anyway) I updated the /community.html page per this thread too. PR: https://github.com/apache/spark-website/pull/16 On Tue, Nov 15, 2016 at 2:49 PM assaf.mendelson <[hidden email]> wrote: Should probably al

Aggregating over sorted data

2016-11-23 Thread assaf.mendelson
Hi, An issue I have encountered frequently is the need to look at data in an ordered manner per key. A common way of doing this can be seen in the classic map reduce as the shuffle stage provides sorted data per key and one can therefore do a lot with that. It is of course relatively easy to achi

RE: structured streaming and window functions

2016-11-17 Thread assaf.mendelson
each session, you want to count number of failed login events. If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816 (didn't start yet) Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 | Email: [hidden email] On Thu, Nov 17, 2016 at 2:52 PM, assaf.

RE: structured streaming and window functions

2016-11-17 Thread assaf.mendelson
sounds, because you need to maintain state in a fault tolerant way and you need to have some eviction policy (watermarks for instance) for aggregation buffers to prevent the state store from reaching an infinite size. On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]> wro

structured streaming and window functions

2016-11-17 Thread assaf.mendelson
Hi, I have been trying to figure out how structured streaming handles window functions efficiently. The portion I understand is that whenever new data arrived, it is grouped by the time and the aggregated data is added to the state. However, unlike operations like sum etc. window functions need t

RE: Handling questions in the mailing lists

2016-11-15 Thread assaf.mendelson
of other contributors besides just PMC/committers). On Wed, Nov 9, 2016 at 2:18 AM, assaf.mendelson <[hidden email]> wrote: I was just wondering, before we move on to SO. Do we have enough contributors with enough reputation do manage things in SO? We would need contributors with enough reputation

RE: separate spark and hive

2016-11-15 Thread assaf.mendelson
l on interoperability. If you don't need persistent catalog, you can just run Spark without Hive mode, can't you? On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]> wrote: Hi, Today, we basically force people to use hive if they want to get the full use of spark S

RE: separate spark and hive

2016-11-15 Thread assaf.mendelson
actually make sense to do, but it just seems a lot of work to do right now and it'd take a toll on interoperability. If you don't need persistent catalog, you can just run Spark without Hive mode, can't you? On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]&

separate spark and hive

2016-11-14 Thread assaf.mendelson
Hi, Today, we basically force people to use hive if they want to get the full use of spark SQL. When doing the default installation this means that a derby.log and metastore_db directory are created where we run from. The problem with this is that if we run multiple scripts from the same working

RE: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread assaf.mendelson
I am not sure I understand when the statistics would be calculated. Would they always be calculated or just when analyze is called? Would it be possible to save analysis results as part of dataframe saving (e.g. when writing it to parquet) or do we have to have a consistent hive installation? Wo

RE: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread assaf.mendelson
While you can download spark 2.0.2, the description is still spark 2.0.1: Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016 (release notes) (git tag) From: rxin

RE: how does isDistinct work on expressions

2016-11-13 Thread assaf.mendelson
belongs to where. Are you creating a new UDAF? What have you done already? GitHub perhaps? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Nov 13, 2016 at

Converting spark types and standard scala types

2016-11-13 Thread assaf.mendelson
Hi, I am trying to write a new aggregate function (https://issues.apache.org/jira/browse/SPARK-17691) and I wanted it to support all ordered types. I have several issues though: 1. How to convert the type of the child expression to a Scala standard type (e.g. I need an Array[Int] for Int

how does isDistinct work on expressions

2016-11-13 Thread assaf.mendelson
Hi, I am trying to understand how aggregate functions are implemented internally. I see that the expression is wrapped using toAggregateExpression using isDistinct. I can't figure out where the code that makes the data distinct is located. I am trying to figure out how the input data is converted

RE: Handling questions in the mailing lists

2016-11-09 Thread assaf.mendelson
Community Mailing Lists / StackOverflow Changes<https://docs.google.com/document/d/1N0pKatcM15cqBPqFWCqIy6jdgNzIoacZlYDCjufBh2s/edit#heading=h.xshc1bv4sn3p> has been updated to include suggested tags. WDYT? On Tue, Nov 8, 2016 at 11:02 PM assaf.mendelson <[hidden email]> wrote: I like the

RE: Handling questions in the mailing lists

2016-11-08 Thread assaf.mendelson
stions on user@ of course, but a lot of the questions I see could >> have been answered with research of existing docs or looking at the code. I >> think that given the scale of the list, it's not wrong to assert that this >> is sort of a prerequisite for asking thousands of people to

RE: Handling questions in the mailing lists

2016-11-06 Thread assaf.mendelson
ong to >> ask questions on user@ of course, but a lot of the questions I see could >> have been answered with research of existing docs or looking at the code. I >> think that given the scale of the list, it's not wrong to assert that this >> is sort of a prerequisite for ask

Handling questions in the mailing lists

2016-11-02 Thread assaf.mendelson
Hi, I know this is a little off topic but I wanted to raise an issue about handling questions in the mailing list (this is true both for the user mailing list and the dev but since there are other options such as stack overflow for user questions, this is more problematic in dev). Let's say I as

RE: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread assaf.mendelson
-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Kind Regards 2016-10-27 9:46 GMT+03:00 assaf.mendelson <[hidden email]>: Hi, Should comments come here or in the JIRA? Any, I am a little confused on the need to expose this as an API to begin with.

RE: Watermarking in Structured Streaming to drop late data

2016-10-26 Thread assaf.mendelson
Hi, Should comments come here or in the JIRA? Any, I am a little confused on the need to expose this as an API to begin with. Let’s consider for a second the most basic behavior: We have some input stream and we want to aggregate a sum over a time window. This means that the window we should be lo

Using SPARK_WORKER_INSTANCES and SPARK-15781

2016-10-26 Thread assaf.mendelson
As of applying SPARK-15781 the documentation of SPARK_WORKER_INSTANCES have been removed. This was due to a warning in spark-submit which suggested: WARN SparkConf: SPARK_WORKER_INSTANCES was detected (set to '4'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --num-

Converting spark types and standard scala types

2016-10-25 Thread assaf.mendelson
Hi, I am trying to write a new aggregate function (https://issues.apache.org/jira/browse/SPARK-17691) and I wanted it to support all ordered types. I have several issues though: 1. How to convert the type of the child expression to a Scala standard type (e.g. I need an Array[Int] for Int

RE: StructuredStreaming status

2016-10-20 Thread assaf.mendelson
My thoughts were of handling just the “current” state of the sliding window (i.e. the “last” window). The idea is that at least in cases which I encountered, the sliding window is used to “forget” irrelevant information and therefore when a step goes out of date for the “current” window it beco

RE: StructuredStreaming status

2016-10-19 Thread assaf.mendelson
There is one issue I was thinking of. If I understand correctly, structured streaming basically groups by a bucket for time in sliding window (of the step). My problem is that in some cases (e.g. distinct count and any other case where the buffer is relatively large) this would mean copying the

RE: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread assaf.mendelson
Hi, We are actually using pyspark heavily. I agree with all of your points, for me I see the following as the main hurdles: 1. Pyspark does not have support for UDAF. We have had multiple needs for UDAF and needed to go to java/scala to support these. Having python UDAF would have made l

RE: Official Stance on Not Using Spark Submit

2016-10-13 Thread assaf.mendelson
I actually not use spark submit for several use cases, all of them currently revolve around running it directly with python. One of the most important ones is developing in pycharm. Basically I have am using pycharm and configure it with a remote interpreter which runs on the server while my pych

RE: Spark Improvement Proposals

2016-10-09 Thread assaf.mendelson
I agree with most of what Cody said. Two things: First we can always have other people suggest SIPs but mark them as "unreviewed" and have committers basically move them forward. The problem is that writing a good document takes time. This way we can leverage non committers to do some of this wo

RE: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-08 Thread assaf.mendelson
I don’t really have much experience with large open source projects but I have some experience with having lots of issues with no one handling them. Automation proved a good solution in my experience, but one thing that I found which was really important is giving people a chance to say “don’t c

https://issues.apache.org/jira/browse/SPARK-17691

2016-09-27 Thread assaf.mendelson
Hi, I wanted to try to implement https://issues.apache.org/jira/browse/SPARK-17691. So I started by looking at the implementation of collect_list. My idea was, do the same as they but when adding a new element, if there are already more than the threshold, remove one instead. The problem with th

RE: Memory usage for spark types

2016-09-20 Thread assaf.mendelson
, Sep 18, 2016 at 9:06 AM, assaf.mendelson <[hidden email]> wrote: Hi, I am trying to understand how spark types are kept in memory and accessed. I tried to look at the code at the definition of MapType and ArrayType for example and I can’t seem to find the relevant code for its

Memory usage for spark types

2016-09-18 Thread assaf.mendelson
Hi, I am trying to understand how spark types are kept in memory and accessed. I tried to look at the code at the definition of MapType and ArrayType for example and I can't seem to find the relevant code for its actual implementation. I am trying to figure out how these two types are implemente

RE: UDF and native functions performance

2016-09-12 Thread assaf.mendelson
code by using .debugCodegen(). // maropu On Mon, Sep 12, 2016 at 7:43 PM, assaf.mendelson <assaf.mendelson@<mailto:assaf.mendelson@>...> wrote: I am trying to create UDFs with improved performance. So I decided to compare several ways of doing it. In general I created a dataframe using range wit

UDF and native functions performance

2016-09-12 Thread assaf.mendelson
I am trying to create UDFs with improved performance. So I decided to compare several ways of doing it. In general I created a dataframe using range with 50M elements, cached it and counted it to manifest it. I then implemented a simple predicate (x<10) in 4 different ways, counted the elements

Test fails when compiling spark with tests

2016-09-11 Thread assaf.mendelson
Hi, I am trying to set up a spark development environment. I forked the spark git project and cloned the fork. I then checked out branch-2.0 tag (which I assume is the released source code). I then compiled spark twice. The first using: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests c

implement UDF/UDAF supporting whole stage codegen

2016-09-07 Thread assaf.mendelson
Hi, I want to write a UDF/UDAF which provides native processing performance. Currently, when creating a UDF/UDAF in a normal manner the performance is hit because it breaks optimizations. For a simple example I wanted to create a UDF which tests whether the value is smaller than 10. I tried some