Re: Dataframe Partitioning

2016-03-01 Thread yash datta
+1 This is one of the most common problems we encounter in our flow. Mark, I am happy to help if you would like to share some of the workload. Best Yash On Wednesday 2 March 2016, Mark Hamstra wrote: > I don't entirely agree. You're best off picking the right size

HashedRelation Memory Pressure on Broadcast Joins

2016-03-01 Thread Matt Cheah
Hi everyone, I had a quick question regarding our implementation of UnsafeHashedRelation and HashedRelation. It appears that we copy the rows that we’ve collected into

Re: Dataframe Partitioning

2016-03-01 Thread Mark Hamstra
I don't entirely agree. You're best off picking the right size :). That's almost impossible, though, since at the input end of the query processing you often want a large number of partitions to get sufficient parallelism for both performance and to avoid spilling or OOM, while at the output end

Dataframe Partitioning

2016-03-01 Thread Teng Liao
Hi, I was wondering what the rationale behind defaulting all repartitioning to spark.sql.shuffle.partitions is. I’m seeing a huge overhead when running a job whose input partitions is 2 and, using the default value for spark.sql.shuffle.partitions, this is now 200. Thanks. -Teng Fei Liao

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Jerry Lam
Hi Reynold, You are right. It is about the audience. For instance, in many of my cases, the SQL style is very attractive if not mandatory for people with minimum programming knowledge. SQL has its place for communication. Last time I show someone spark dataframe-style, they immediately said it is

Re: [Proposal] Enabling time series analysis on spark metrics

2016-03-01 Thread Reynold Xin
Is the suggestion just to use a different config (and maybe fallback to appid) in order to publish metrics? Seems reasonable. On Tue, Mar 1, 2016 at 8:17 AM, Karan Kumar wrote: > +dev mailing list > > Time series analysis on metrics becomes quite useful when running

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Reynold Xin
There are definitely pros and cons for Scala vs SQL-style CEP. Scala might be more powerful, but the target audience is very different. How much usage is there for a CEP style SQL syntax in practice? I've never seen it coming up so far. On Tue, Mar 1, 2016 at 9:35 AM, Alex Kozlov

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Jerry Lam
Hi Henri, Finally, there is a good reason for me to use Flink! Thanks for sharing this information. This is exactly the solution I'm looking for especially the ticket references a paper I was reading a week ago. It would be nice if Flink adds support SQL because this makes business analyst

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Henri Dubois-Ferriere
fwiw Apache Flink just added CEP. Queries are constructed programmatically rather than in SQL, but the underlying functionality is similar. https://issues.apache.org/jira/browse/FLINK-3215 On 1 March 2016 at 08:19, Jerry Lam wrote: > Hi Herman, > > Thank you for your

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Jerry Lam
Hi Herman, Thank you for your reply! This functionality usually finds its place in financial services which use CEP (complex event processing) for correlation and pattern matching. Many commercial products have this including Oracle and Teradata Aster Data MR Analytics. I do agree the syntax a

Re: [Proposal] Enabling time series analysis on spark metrics

2016-03-01 Thread Karan Kumar
+dev mailing list Time series analysis on metrics becomes quite useful when running spark jobs using a workflow manager like oozie. Would love to take this up if the community thinks its worthwhile. On Tue, Feb 23, 2016 at 2:59 PM, Karan Kumar wrote: > HI > > Spark

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Herman van Hövell tot Westerflier
Hi Jerry, This is not on any roadmap. I (shortly) browsed through this; and this looks like some sort of a window function with very awkward syntax. I think spark provided better constructs for this using dataframes/datasets/nested data... Feel free to submit a PR. Kind regards, Herman van

SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Jerry Lam
Hi Spark developers, Will you consider to add support for implementing "Pattern matching in sequences of rows"? More specifically, I'm referring to this: http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf This is a very cool/useful feature to pattern

Re: What should be spark.local.dir in spark on yarn?

2016-03-01 Thread Jeff Zhang
You are using yarn-client mode, the driver is not yarn container, so it can not use yarn.nodemanager.local-dirs, only have to use spark.local.dir which /tmp by default. But usually driver won't cost too much disk, so it should be fine to use /tmp in driver side. On Tue, Mar 1, 2016 at 4:57 PM,

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Jeff Zhang
Check the code again. Looks like currently the task result will be loaded into memory no matter it is DirectTaskResult or InDirectTaskResult. Previous I thought InDirectTaskResult can be loaded into memory later which can save memory, RDD#collectAsIterator is what I thought that may save memory.

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Reynold Xin
How big of a deal is this though? If I am reading your email correctly, either way this job will fail. You simply want it to fail earlier in the executor side, rather than collecting it and fail on the driver side? On Sunday, February 28, 2016, Jeff Zhang wrote: > data skew

Re: What should be spark.local.dir in spark on yarn?

2016-03-01 Thread Alexander Pivovarov
spark 1.6.0 uses /tmp in the following places # spark.local.dir is not set yarn.nodemanager.local-dirs=/data01/yarn/nm,/data02/yarn/nm 1. spark-shell on start 16/03/01 08:33:48 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-ffd3143d-b47f-4844-99fd-2d51c6a05d05 2.

Re: Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang
I may not express it clearly. This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. And the virtualenvs don't need to be the same path for each node (In my POC, it is