Anyone else having trouble with replicated off heap RDD persistence?

2016-08-16 Thread Michael Allman
Hello, A coworker was having a problem with a big Spark job failing after several hours when one of the executors would segfault. That problem aside, I speculated that her job would be more robust against these kinds of executor crashes if she used replicated RDD storage. She's using off heap

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
FWIW, this is an essential feature to our use of Spark, and I'm surprised it's not advertised clearly as a limitation in the documentation. All I've found about running Spark 1.3 on 2.11 is here:http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211Also, I'm experiencing

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
not not-ready; it's just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the unofficial support to at least make the latest Scala 2.11 the unbroken one. On Fri, Apr 17, 2015 at 7:58 AM, Michael Allman mich...@videoamp.com wrote: FWIW, this is an essential feature to our use of Spark

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
at 10:31 PM, Michael Allman mich...@videoamp.com wrote: H... I don't follow. The 2.11.x series is supposed to be binary compatible against user code. Anyway, I was building Spark against 2.11.2 and still saw the problems with the REPL. I've created a bug report: https://issues.apache.org/jira

independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)

2014-11-17 Thread Michael Allman
Hello, We're running a spark sql thriftserver that several users connect to with beeline. One limitation we've run into is that the current working database (set with use db) is shared across all connections. So changing the database on one connection changes the database for all connections.

Re: [SQL] Set Parquet block size?

2014-10-09 Thread Michael Allman
Hi Pierre, I'm setting parquet (and hdfs) block size like follows: val ONE_GB = 1024 * 1024 * 1024 sc.hadoopConfiguration.setInt(dfs.blocksize, ONE_GB) sc.hadoopConfiguration.setInt(parquet.block.size, ONE_GB) Here, sc is a reference to the spark context. I've tested this and it

Re: window every n elements instead of time based

2014-10-08 Thread Michael Allman
be that it breaks the concept of window operations which are in Spark. Thanks, Jayant On Tue, Oct 7, 2014 at 10:19 PM, Michael Allman [hidden email] wrote: Hi Andrew, The use case I have in mind is batch data serialization to HDFS, where sizing files to a certain HDFS block size

Re: Interactive interface tool for spark

2014-10-08 Thread Michael Allman
Hi Andy, This sounds awesome. Please keep us posted. Meanwhile, can you share a link to your project? I wasn't able to find it. Cheers, Michael On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com wrote: Heya You can check Zeppellin or my fork of the Scala notebook. I'm

Re: Interactive interface tool for spark

2014-10-08 Thread Michael Allman
Ummm... what's helium? Link, plz? On Oct 8, 2014, at 9:13 AM, Stephen Boesch java...@gmail.com wrote: @kevin, Michael, Second that: interested in seeing the zeppelin. pls use helium though .. 2014-10-08 7:57 GMT-07:00 Michael Allman mich...@videoamp.com: Hi Andy, This sounds awesome

Re: Support for Parquet V2 in ParquetTableSupport?

2014-10-08 Thread Michael Allman
are hoping to do some upgrades of our parquet support in the near future. On Tue, Oct 7, 2014 at 10:33 PM, Michael Allman mich...@videoamp.com wrote: Hello, I was interested in testing Parquet V2 with Spark SQL, but noticed after some investigation that the parquet writer that Spark SQL uses

Support for Parquet V2 in ParquetTableSupport?

2014-10-07 Thread Michael Allman
Hello, I was interested in testing Parquet V2 with Spark SQL, but noticed after some investigation that the parquet writer that Spark SQL uses is fixed at V1 here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L350.

Re: window every n elements instead of time based

2014-10-03 Thread Michael Allman
Hi, I also have a use for count-based windowing. I'd like to process data batches by size as opposed to time. Is this feature on the development roadmap? Is there a JIRA ticket for it? Thank you, Michael -- View this message in context:

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Michael Allman
I just ran a runtime performance comparison between 0.9.0-incubating and your als branch. I saw a 1.5x improvement in performance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html Sent from the

Re: possible bug in Spark's ALS implementation...

2014-03-14 Thread Michael Allman
I've been thoroughly investigating this issue over the past couple of days and have discovered quite a bit. For one thing, there is definitely (at least) one issue/bug in the Spark implementation that leads to incorrect results for models generated with rank 1 or a large number of iterations. I

is spark.cleaner.ttl safe?

2014-03-11 Thread Michael Allman
Hello, I've been trying to run an iterative spark job that spills 1+ GB to disk per iteration on a system with limited disk space. I believe there's enough space if spark would clean up unused data from previous iterations, but as it stands the number of iterations I can run is limited by

possible bug in Spark's ALS implementation...

2014-03-11 Thread Michael Allman
Hi, I'm implementing a recommender based on the algorithm described in http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the basis for Spark's ALS implementation for data sets with implicit features. The data set I'm working with is proprietary and I cannot share it,