[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-11-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222545#comment-14222545 ] Matei Zaharia commented on SPARK-3633: -- [~stephen] you can try the 1.1.1 RC in http

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-23 Thread Matei Zaharia
Interesting, perhaps we could publish each one with two IDs, of which the rc one is unofficial. The problem is indeed that you have to vote on a hash for a potentially final artifact. Matei On Nov 23, 2014, at 7:54 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Hi, I wanted to

Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-20 Thread Matei Zaharia
You can still send patches for docs until the release goes out -- please do if you see stuff. Matei On Nov 20, 2014, at 6:39 AM, Madhu ma...@madhu.com wrote: Thanks Patrick. I've been testing some 1.2 features, looks good so far. I have some example code that I think will be helpful for

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Matei Zaharia
-rc2/ http://people.apache.org/~andrewor14/spark-1.1.1-rc2/ On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Hector, is this a comment on 1.1.1 or on the 1.2 preview? Matei On Nov 20, 2014, at 11:39 AM, Hector Yee hector

Re: rack-topology.sh no such file or directory

2014-11-19 Thread Matei Zaharia
Your Hadoop configuration is set to look for this file to determine racks. Is the file present on cluster nodes? If not, look at your hdfs-site.xml and remove the setting for a rack topology script there (or it might be in core-site.xml). Matei On Nov 19, 2014, at 12:13 PM, Arun Luthra

[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216691#comment-14216691 ] Matei Zaharia commented on SPARK-4452: -- BTW I've thought about this more and here's

[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217331#comment-14217331 ] Matei Zaharia commented on SPARK-4452: -- Forced spilling is orthogonal to how you set

[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215425#comment-14215425 ] Matei Zaharia commented on SPARK-4452: -- How much of this gets fixed if you fix

[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215557#comment-14215557 ] Matei Zaharia commented on SPARK-4452: -- BTW we may also want to create a separate

[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215556#comment-14215556 ] Matei Zaharia commented on SPARK-4452: -- Got it. It would be fine to do this if you

[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4306: - Target Version/s: 1.2.0 LogisticRegressionWithLBFGS support for PySpark MLlib

[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214134#comment-14214134 ] Matei Zaharia commented on SPARK-4306: -- [~srinathsmn] I've assigned it to you. When

[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4306: - Assignee: Varadharajan LogisticRegressionWithLBFGS support for PySpark MLlib

[jira] [Created] (SPARK-4435) Add setThreshold in Python LogisticRegressionModel and SVMModel

2014-11-16 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4435: Summary: Add setThreshold in Python LogisticRegressionModel and SVMModel Key: SPARK-4435 URL: https://issues.apache.org/jira/browse/SPARK-4435 Project: Spark

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214155#comment-14214155 ] Matei Zaharia commented on SPARK-4434: -- [~joshrosen] make sure to revert this on 1.2

[jira] [Created] (SPARK-4439) Export RandomForest in Python

2014-11-16 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4439: Summary: Export RandomForest in Python Key: SPARK-4439 URL: https://issues.apache.org/jira/browse/SPARK-4439 Project: Spark Issue Type: New Feature

[jira] [Updated] (SPARK-4439) Expose RandomForest in Python

2014-11-16 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4439: - Summary: Expose RandomForest in Python (was: Export RandomForest in Python) Expose RandomForest

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-14 Thread Matei Zaharia
+1 Tested on Mac OS X, and verified that sort-based shuffle bug is fixed. Matei On Nov 14, 2014, at 10:45 AM, Andrew Or and...@databricks.com wrote: Hi all, since the vote ends on a Sunday, please let me know if you would like to extend the deadline to allow more time for testing.

[jira] [Resolved] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4330. -- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Target

[jira] [Updated] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4330: - Assignee: Kousuke Saruta Link to proper URL for YARN overview

Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Matei Zaharia
Just curious, what are the pros and cons of this? Can the 0.8.1.1 client still talk to 0.8.0 versions of Kafka, or do you need it to match your Kafka version exactly? Matei On Nov 10, 2014, at 9:48 AM, Bhaskar Dutta bhas...@gmail.com wrote: Hi, Is there any plan to bump the Kafka

Re: closure serialization behavior driving me crazy

2014-11-10 Thread Matei Zaharia
Hey Sandy, Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to print the contents of the objects. In addition, something else that helps is to do the following: { val _arr = arr models.map(... _arr ...) } Basically, copy the global variable into a local one.

Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Matei Zaharia
Call getNumPartitions() on your RDD to make sure it has the right number of partitions. You can also specify it when doing parallelize, e.g. rdd = sc.parallelize(xrange(1000), 10)) This should run in parallel if you have multiple partitions and cores, but it might be that during part of the

[RESULT] [VOTE] Designating maintainers for some Spark components

2014-11-08 Thread Matei Zaharia
is just to have a better structure for reviewing and minimize the chance of errors. Here is a tally of the votes: Binding votes (from PMC): 17 +1, no 0 or -1 Matei Zaharia Michael Armbrust Reynold Xin Patrick Wendell Andrew Or Prashant Sharma Mark Hamstra Xiangrui Meng Ankur Dave Imran Rashid Jason

Re: wierd caching

2014-11-08 Thread Matei Zaharia
It might mean that some partition was computed on two nodes, because a task for it wasn't able to be scheduled locally on the first node. Did the RDD really have 426 partitions total? You can click on it and see where there are copies of each one. Matei On Nov 8, 2014, at 10:16 PM, Nathan

[jira] [Commented] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class

2014-11-07 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203147#comment-14203147 ] Matei Zaharia commented on SPARK-4303: -- Yup, this will actually become easier

[jira] [Resolved] (SPARK-4186) Support binaryFiles and binaryRecords API in Python

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4186. -- Resolution: Fixed Fix Version/s: 1.2.0 Support binaryFiles and binaryRecords API

[jira] [Resolved] (SPARK-644) Jobs canceled due to repeated executor failures may hang

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-644. - Resolution: Fixed Jobs canceled due to repeated executor failures may hang

[jira] [Resolved] (SPARK-643) Standalone master crashes during actor restart

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-643. - Resolution: Fixed Standalone master crashes during actor restart

[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200514#comment-14200514 ] Matei Zaharia commented on SPARK-677: - [~joshrosen] is this fixed now? PySpark should

[jira] [Resolved] (SPARK-681) Optimize hashtables used in Spark

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-681. - Resolution: Fixed Optimize hashtables used in Spark

[jira] [Resolved] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-993. - Resolution: Won't Fix We investigated this for 1.0 but found that many InputFormats behave wrongly

[jira] [Commented] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200531#comment-14200531 ] Matei Zaharia commented on SPARK-993: - Arun, you'd see this issue if you do collect

[jira] [Closed] (SPARK-1000) Crash when running SparkPi example with local-cluster

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-1000. Resolution: Cannot Reproduce Crash when running SparkPi example with local-cluster

[jira] [Resolved] (SPARK-1023) Remove Thread.sleep(5000) from TaskSchedulerImpl

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1023. -- Resolution: Fixed Remove Thread.sleep(5000) from TaskSchedulerImpl

[jira] [Resolved] (SPARK-1185) In Spark Programming Guide, Master URLs should mention yarn-client

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1185. -- Resolution: Fixed In Spark Programming Guide, Master URLs should mention yarn-client

[jira] [Closed] (SPARK-2237) Add ZLIBCompressionCodec code

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-2237. Resolution: Won't Fix Add ZLIBCompressionCodec code

[jira] [Updated] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2014-11-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2348: - Priority: Critical (was: Major) In Windows having a enviorinment variable named 'classpath

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Matei Zaharia
traffic, and be very active in design API discussions. That leads to better consistency and long-term design choices. Cheers, bc On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Matei Zaharia
On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Matei Zaharia
Alright, Greg, I think I understand how Subversion's model is different, which is that the PMC members are all full committers. However, I still think that the model proposed here is purely organizational (how the PMC and committers organize themselves), and in no way changes peoples' ownership

[jira] [Updated] (SPARK-4222) FixedLengthBinaryRecordReader should readFully

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4222: - Assignee: Jascha Swisher FixedLengthBinaryRecordReader should readFully

[jira] [Resolved] (SPARK-4222) FixedLengthBinaryRecordReader should readFully

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4222. -- Resolution: Fixed Fix Version/s: 1.2.0 FixedLengthBinaryRecordReader should readFully

[jira] [Updated] (SPARK-4040) Update spark documentation for local mode and spark-streaming.

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4040: - Assignee: jay vyas Update spark documentation for local mode and spark-streaming

[jira] [Resolved] (SPARK-4040) Update spark documentation for local mode and spark-streaming.

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4040. -- Resolution: Fixed Update spark documentation for local mode and spark-streaming

[jira] [Resolved] (SPARK-565) Integrate spark in scala standard collection API

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-565. - Resolution: Won't Fix FYI I'm going to close this because we've locked down the API for 1.X

[jira] [Closed] (SPARK-542) Cache Miss when machine have multiple hostname

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-542. --- Resolution: Won't Fix New versions of Spark have ways to specify the hostname and IP address to bind

[jira] [Resolved] (SPARK-600) SparkContext.stop and clearJars delete local JAR files

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-600. - Resolution: Fixed Should no longer be a problem since 1.0 SparkContext.stop and clearJars delete

[jira] [Resolved] (SPARK-619) Hadoop MapReduce should be configured to use all local disks for shuffle on AMI

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-619. - Resolution: Fixed Hadoop MapReduce should be configured to use all local disks for shuffle

[jira] [Resolved] (SPARK-656) Let Amazon choose our EC2 clusters' availability zone if the user does not specify one

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-656. - Resolution: Fixed Let Amazon choose our EC2 clusters' availability zone if the user does

[jira] [Resolved] (SPARK-610) Support master failover in standalone mode

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-610. - Resolution: Fixed Fix Version/s: 0.8.1 Assignee: Aaron Davidson Support master

[jira] [Commented] (SPARK-785) ClosureCleaner not invoked on most PairRDDFunctions

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199922#comment-14199922 ] Matei Zaharia commented on SPARK-785: - [~adav] it still seems to be, weirdly enough

[jira] [Resolved] (SPARK-812) Netty shuffle creates a lot of open file handles

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-812. - Resolution: Invalid No longer a problem for new versions of the Netty shuffle Netty shuffle

[jira] [Resolved] (SPARK-880) When built with Hadoop2, spark-shell and examples don't initialize log4j properly

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-880. - Resolution: Fixed When built with Hadoop2, spark-shell and examples don't initialize log4j

[jira] [Resolved] (SPARK-824) Make less copies of blocks during remote reads

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-824. - Resolution: Fixed This is a pretty old issue that no longer affects the newest block manager

[jira] [Resolved] (SPARK-914) Make RDD implement Scala and Java Iterable interfaces

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-914. - Resolution: Fixed Fix Version/s: 1.0.0 Make RDD implement Scala and Java Iterable

[jira] [Resolved] (SPARK-1063) Add .sortBy(f) method on RDD

2014-11-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1063. -- Resolution: Fixed Add .sortBy(f) method on RDD

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia
this happen. Updated blog post: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you

[VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Matei Zaharia
Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Matei Zaharia
need a maintainer for Mesos, and I wonder if there is someone that can be added to that? Tim On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote

Re: Surprising Spark SQL benchmark

2014-11-05 Thread Matei Zaharia
Yup, the Hadoop nodes were from 2013, each with 64 GB RAM, 12 cores, 10 Gbps Ethernet and 12 disks. For 100 TB of data, the intermediate data could fit in memory on this cluster, which can make shuffle much faster than with intermediate data on SSDs. You can find the specs in

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Matei Zaharia
, 2014 at 1:31 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still

Re: Any Replicated RDD in Spark?

2014-11-05 Thread Matei Zaharia
for me to do that? Collect RDD in driver first and create broadcast? Or any shortcut in spark for this? Thanks! -Original Message- From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, November 05, 2014 3:32 PM To: 'Matei Zaharia' Cc: 'user@spark.apache.org' Subject

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia
this happen. Updated blog post: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
exported from Redshift into Spark or Hadoop. Matei On Nov 4, 2014, at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote: Hey Matt, There's some prior work that compares

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
(BTW this had a bug with negative hash codes in 1.1.0 so you should try branch-1.1 for it). Matei On Nov 3, 2014, at 6:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer

Re: Any Replicated RDD in Spark?

2014-11-03 Thread Matei Zaharia
You need to use broadcast followed by flatMap or mapPartitions to do map-side joins (in your map function, you can look at the hash table you broadcast and see what records match it). Spark SQL also does it by default for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by

[jira] [Resolved] (SPARK-3466) Limit size of results that a driver collects for each action

2014-11-02 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3466. -- Resolution: Fixed Fix Version/s: 1.2.0 Limit size of results that a driver collects

[jira] [Resolved] (SPARK-2759) The ability to read binary files into Spark

2014-11-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2759. -- Resolution: Fixed Fix Version/s: 1.2.0 The ability to read binary files into Spark

[jira] [Created] (SPARK-4186) Support binaryFiles and binaryRecords API in Python

2014-11-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4186: Summary: Support binaryFiles and binaryRecords API in Python Key: SPARK-4186 URL: https://issues.apache.org/jira/browse/SPARK-4186 Project: Spark Issue Type

[jira] [Commented] (SPARK-4186) Support binaryFiles and binaryRecords API in Python

2014-11-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193363#comment-14193363 ] Matei Zaharia commented on SPARK-4186: -- [~davies] it would be great if you have

[jira] [Resolved] (SPARK-3932) Support reading fixed-precision decimals from Hive 0.13

2014-11-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3932. -- Resolution: Fixed Fix Version/s: 1.2.0 Done in https://github.com/apache/spark/pull/2983

[jira] [Commented] (SPARK-3931) Support reading fixed-precision decimals from Parquet

2014-11-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193666#comment-14193666 ] Matei Zaharia commented on SPARK-3931: -- Done in https://github.com/apache/spark/pull

[jira] [Resolved] (SPARK-3931) Support reading fixed-precision decimals from Parquet

2014-11-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3931. -- Resolution: Fixed Fix Version/s: 1.2.0 Support reading fixed-precision decimals from

[jira] [Resolved] (SPARK-3929) Support for fixed-precision decimal

2014-11-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3929. -- Resolution: Fixed Fix Version/s: 1.2.0 Support for fixed-precision decimal

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on the results. Matei On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote: I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA),

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Matei. What does unionAll do if the input RDD schemas are not 100% compatible. Does it take the union of the columns and generalize the types? thanks Daniel On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Try unionAll, which

[jira] [Updated] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3561: - Fix Version/s: (was: 1.2.0) Allow for pluggable execution contexts in Spark

[jira] [Created] (SPARK-4176) Support decimals with precision 18 in Parquet

2014-10-31 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4176: Summary: Support decimals with precision 18 in Parquet Key: SPARK-4176 URL: https://issues.apache.org/jira/browse/SPARK-4176 Project: Spark Issue Type: New

[jira] [Resolved] (SPARK-1847) Pushdown filters on non-required parquet columns

2014-10-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1847. -- Resolution: Fixed Fix Version/s: 1.2.0 Pushdown filters on non-required parquet columns

[jira] [Updated] (SPARK-3968) Use parquet-mr filter2 api in spark sql

2014-10-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3968: - Assignee: Yash Datta Use parquet-mr filter2 api in spark sql

[jira] [Updated] (SPARK-1847) Pushdown filters on non-required parquet columns

2014-10-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1847: - Assignee: Yash Datta Pushdown filters on non-required parquet columns

Re: SparkContext.stop() ?

2014-10-31 Thread Matei Zaharia
You don't have to call it if you just exit your application, but it's useful for example in unit tests if you want to create and shut down a separate SparkContext for each test. Matei On Oct 31, 2014, at 10:39 AM, Evan R. Sparks evan.spa...@gmail.com wrote: In cluster settings if you don't

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
Try using --jars instead of the driver-only options; they should work with spark-shell too but they may be less tested. Unfortunately, you do have to specify each JAR separately; you can maybe use a shell script to list a directory and get a big list, or set up a project that builds all of the

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
to spark-shell. Correct? If so I will file a bug report since this is definitely not the case. On Thu, Oct 30, 2014 at 5:39 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Try using --jars instead of the driver-only options; they should work with spark-shell

Re: BUG: when running as extends App, closures don't capture variables

2014-10-29 Thread Matei Zaharia
Good catch! If you'd like, you can send a pull request changing the files in docs/ to do this (see https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark), otherwise maybe open an issue on

[jira] [Updated] (SPARK-3466) Limit size of results that a driver collects for each action

2014-10-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3466: - Priority: Critical (was: Major) Limit size of results that a driver collects for each action

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Matei Zaharia
Hi Stephen, How did you generate your Maven workspace? You need to make sure the Hive profile is enabled for it. For example sbt/sbt -Phive gen-idea. Matei On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com wrote: I have run on the command line via maven and it is fine: mvn

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
The overridable methods of RDD are marked as @DeveloperApi, which means that these are internal APIs used by people that might want to extend Spark, but are not guaranteed to remain stable across Spark versions (unlike Spark's public APIs). BTW, if you want a way to do this that does not

[jira] [Commented] (SPARK-3466) Limit size of results that a driver collects for each action

2014-10-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178025#comment-14178025 ] Matei Zaharia commented on SPARK-3466: -- Ah, I see, that concern makes sense

Re: Primitive arrays in Spark

2014-10-21 Thread Matei Zaharia
It seems that ++ does the right thing on arrays of longs, and gives you another one: scala val a = Array[Long](1,2,3) a: Array[Long] = Array(1, 2, 3) scala val b = Array[Long](1,2,3) b: Array[Long] = Array(1, 2, 3) scala a ++ b res0: Array[Long] = Array(1, 2, 3, 1, 2, 3) scala res0.getClass

[jira] [Updated] (SPARK-3467) Python BatchedSerializer should dynamically lower batch size for large objects

2014-10-20 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3467: - Assignee: Davies Liu Python BatchedSerializer should dynamically lower batch size for large

[jira] [Commented] (SPARK-3655) Secondary sort

2014-10-20 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177824#comment-14177824 ] Matei Zaharia commented on SPARK-3655: -- I believe you can build this on top

Re: Submissions open for Spark Summit East 2015

2014-10-19 Thread Matei Zaharia
BTW several people asked about registration and student passes. Registration will open in a few weeks, and like in previous Spark Summits, I expect there to be a special pass for students. Matei On Oct 18, 2014, at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: After successful

Re: Submissions open for Spark Summit East 2015

2014-10-19 Thread Matei Zaharia
BTW several people asked about registration and student passes. Registration will open in a few weeks, and like in previous Spark Summits, I expect there to be a special pass for students. Matei On Oct 18, 2014, at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: After successful

Re: Raise Java dependency from 6 to 7

2014-10-18 Thread Matei Zaharia
I'd also wait a bit until these are gone. Jetty is unfortunately a much hairier topic by the way, because the Hadoop libraries also depend on Jetty. I think it will be hard to update. However, a patch that shades Jetty might be nice to have, if that doesn't require shading a lot of other stuff.

Submissions open for Spark Summit East 2015

2014-10-18 Thread Matei Zaharia
After successful events in the past two years, the Spark Summit conference has expanded for 2015, offering both an event in New York on March 18-19 and one in San Francisco on June 15-17. The conference is a great chance to meet people from throughout the Spark community and see the latest

<    1   2   3   4   5   6   7   8   9   10   >