Re: mllib.linalg.Vectors vs Breeze?

2014-10-18 Thread Matei Zaharia
toBreeze is private within Spark, it should not be accessible to users. If you want to make a Breeze vector from an MLlib one, it's pretty straightforward, and you can make your own utility function for it. Matei On Oct 17, 2014, at 5:09 PM, Sean Owen so...@cloudera.com wrote: Yes, I

Submissions open for Spark Summit East 2015

2014-10-18 Thread Matei Zaharia
After successful events in the past two years, the Spark Summit conference has expanded for 2015, offering both an event in New York on March 18-19 and one in San Francisco on June 15-17. The conference is a great chance to meet people from throughout the Spark community and see the latest

[jira] [Created] (SPARK-3929) Support for fixed-precision decimal

2014-10-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3929: Summary: Support for fixed-precision decimal Key: SPARK-3929 URL: https://issues.apache.org/jira/browse/SPARK-3929 Project: Spark Issue Type: New Feature

[jira] [Created] (SPARK-3930) Add precision and scale to Spark SQL's Decimal type

2014-10-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3930: Summary: Add precision and scale to Spark SQL's Decimal type Key: SPARK-3930 URL: https://issues.apache.org/jira/browse/SPARK-3930 Project: Spark Issue Type

[jira] [Updated] (SPARK-3929) Support for fixed-precision decimal

2014-10-13 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3929: - Description: Spark SQL should support fixed-precision decimals, which are available in Hive 0.13

[jira] [Created] (SPARK-3931) Support reading fixed-precision decimals from Parquet

2014-10-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3931: Summary: Support reading fixed-precision decimals from Parquet Key: SPARK-3931 URL: https://issues.apache.org/jira/browse/SPARK-3931 Project: Spark Issue

[jira] [Created] (SPARK-3932) Support reading fixed-precision decimals from Hive 0.13

2014-10-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3932: Summary: Support reading fixed-precision decimals from Hive 0.13 Key: SPARK-3932 URL: https://issues.apache.org/jira/browse/SPARK-3932 Project: Spark Issue

[jira] [Created] (SPARK-3933) Optimize decimal type in Spark SQL for those with small precision

2014-10-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3933: Summary: Optimize decimal type in Spark SQL for those with small precision Key: SPARK-3933 URL: https://issues.apache.org/jira/browse/SPARK-3933 Project: Spark

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
of issues. Thanks in advance! On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
of issues. Thanks in advance! On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB

Re: reading/writing parquet decimal type

2014-10-12 Thread Matei Zaharia
Hi Michael, I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with these features soon, but meanwhile you can try this branch. See https://github.com/mateiz/spark/compare/decimal for the individual commits that went into it. It

Re: reading/writing parquet decimal type

2014-10-12 Thread Matei Zaharia
the values as a parquet binary type. Why not write them using the int64 parquet type instead? Cheers, Michael On Oct 12, 2014, at 3:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Michael, I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make

Re: Blog post: An Absolutely Unofficial Way to Connect Tableau to SparkSQL (Spark 1.1)

2014-10-11 Thread Matei Zaharia
Very cool Denny, thanks for sharing this! Matei On Oct 11, 2014, at 9:46 AM, Denny Lee denny.g@gmail.com wrote: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql If you're wondering how to connect Tableau to SparkSQL - here are the steps to connect Tableau to SparkSQL.

Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Matei Zaharia
Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at

Re: add Boulder-Denver Spark meetup to list on website

2014-10-10 Thread Matei Zaharia
Added you, thanks! (You may have to shift-refresh the page to see it updated). Matei On Oct 10, 2014, at 1:52 PM, Michael Oczkowski michael.oczkow...@seeq.com wrote: Please add the Boulder-Denver Spark meetup group to the list on the website.

Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Thanks for the feedback. For 1, there is an open patch: https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually use MEMORY_AND_DISK storage, so they will spill to disk if you have low memory, but they're faster to access otherwise. Matei On Oct 9, 2014, at 12:11 PM,

Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Oops I forgot to add, for 2, maybe we can add a flag to use DISK_ONLY for TorrentBroadcast, or if the broadcasts are bigger than some size. Matei On Oct 9, 2014, at 3:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Thanks for the feedback. For 1, there is an open patch: https

Re: Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Matei Zaharia
A SchemaRDD is still an RDD, so you can just do rdd.map(row = row.toString). Or if you want to get a particular field of the row, you can do rdd.map(row = row(3).toString). Matei On Oct 9, 2014, at 1:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote: I've a SchemaRDD that I want to

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full. Matei On Oct 7, 2014, at 11:04 PM, Haopu Wang

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full. Matei On Oct 7, 2014, at 11:04 PM, Haopu Wang

[jira] [Resolved] (SPARK-3762) clear all SparkEnv references after stop

2014-10-07 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3762. -- Resolution: Fixed Fix Version/s: 1.2.0 clear all SparkEnv references after stop

Re: TorrentBroadcast slow performance

2014-10-07 Thread Matei Zaharia
Maybe there is a firewall issue that makes it slow for your nodes to connect through the IP addresses they're configured with. I see there's this 10 second pause between Updated info of block broadcast_84_piece1 and ensureFreeSpace(4194304) called (where it actually receives the block). HTTP

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Matei Zaharia
The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd write this in HiveQL, and then try doing that with HiveContext.

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-10-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161117#comment-14161117 ] Matei Zaharia commented on SPARK-3633: -- I'm curious, why do you think this is caused

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-10-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161142#comment-14161142 ] Matei Zaharia commented on SPARK-3633: -- In that case though, the problem might

[jira] [Resolved] (SPARK-2530) Relax incorrect assumption of one ExternalAppendOnlyMap per thread

2014-10-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2530. -- Resolution: Fixed Fix Version/s: 1.1.0 This was fixed by SPARK-2711. Relax incorrect

Re: Impact of input format on timing

2014-10-05 Thread Matei Zaharia
Hi Tom, HDFS and Spark don't actually have a minimum block size -- so in that first dataset, the files won't each be costing you 64 MB. However, the main reason for difference in performance here is probably the number of RDD partitions. In the first case, Spark will create an RDD with 1

Re: Jython importing pyspark?

2014-10-05 Thread Matei Zaharia
PySpark doesn't attempt to support Jython at present. IMO while it might be a bit faster, it would lose a lot of the benefits of Python, which are the very strong data processing libraries (NumPy, SciPy, Pandas, etc). So I'm not sure it's worth supporting unless someone demonstrates a really

Re: run scalding on spark

2014-10-01 Thread Matei Zaharia
Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects. Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko...@tresata.com wrote: well, sort of! we make input/output formats (cascading taps,

Re: Spark And Mapr

2014-10-01 Thread Matei Zaharia
It should just work in PySpark, the same way it does in Java / Scala apps. Matei On Oct 1, 2014, at 4:12 PM, Sungwook Yoon sy...@maprtech.com wrote: Yes.. you should use maprfs:// I personally haven't used pyspark, I just used scala shell or standalone with MapR. I think you need to

Re: Multiple spark shell sessions

2014-10-01 Thread Matei Zaharia
You need to set --total-executor-cores to limit how many total cores it grabs on the cluster. --executor-cores is just for each individual executor, but it will try to launch many of them. Matei On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey

[jira] [Resolved] (SPARK-3356) Document when RDD elements' ordering within partitions is nondeterministic

2014-09-30 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3356. -- Resolution: Fixed Fix Version/s: 1.2.0 Document when RDD elements' ordering within

[jira] [Updated] (SPARK-3356) Document when RDD elements' ordering within partitions is nondeterministic

2014-09-30 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3356: - Assignee: Sean Owen Document when RDD elements' ordering within partitions is nondeterministic

[jira] [Resolved] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3032. -- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Potential bug when

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152014#comment-14152014 ] Matei Zaharia commented on SPARK-3032: -- Yup, this will appear in 1.1.1. I've merged

[jira] [Resolved] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark

2014-09-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3389. -- Resolution: Fixed Fix Version/s: 1.2.0 Add converter class to make reading Parquet

[jira] [Updated] (SPARK-2745) Add Java friendly methods to Duration class

2014-09-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2745: - Assignee: Sean Owen (was: Sean Owen) Add Java friendly methods to Duration class

[jira] [Updated] (SPARK-2745) Add Java friendly methods to Duration class

2014-09-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2745: - Assignee: Sean Owen (was: Tathagata Das) Add Java friendly methods to Duration class

[jira] [Resolved] (SPARK-2745) Add Java friendly methods to Duration class

2014-09-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2745. -- Resolution: Fixed Fix Version/s: 1.2.0 Add Java friendly methods to Duration class

[jira] [Updated] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark

2014-09-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3389: - Target Version/s: 1.2.0 Add converter class to make reading Parquet files easy with PySpark

[jira] [Updated] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark

2014-09-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3389: - Assignee: Uri Laserson Add converter class to make reading Parquet files easy with PySpark

[jira] [Comment Edited] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145324#comment-14145324 ] Matei Zaharia edited comment on SPARK-3129 at 9/23/14 7:53 PM

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145324#comment-14145324 ] Matei Zaharia commented on SPARK-3129: -- Is that 100 MB/s per node or in total

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145537#comment-14145537 ] Matei Zaharia commented on SPARK-3129: -- Alright, in that case, this sounds pretty

Re: Spark Code to read RCFiles

2014-09-23 Thread Matei Zaharia
Is your file managed by Hive (and thus present in a Hive metastore)? In that case, Spark SQL (https://spark.apache.org/docs/latest/sql-programming-guide.html) is the easiest way. Matei On September 23, 2014 at 2:26:10 PM, Pramod Biligiri (pramodbilig...@gmail.com) wrote: Hi, I'm trying to

[jira] [Created] (SPARK-3643) Add cluster-specific config settings to configuration page

2014-09-22 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3643: Summary: Add cluster-specific config settings to configuration page Key: SPARK-3643 URL: https://issues.apache.org/jira/browse/SPARK-3643 Project: Spark

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144244#comment-14144244 ] Matei Zaharia commented on SPARK-3032: -- Yeah actually I'm sure TimSort works fine

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144243#comment-14144243 ] Matei Zaharia commented on SPARK-3032: -- I'm not completely sure that this is because

[jira] [Updated] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3032: - Assignee: Saisai Shao Potential bug when running sort-based shuffle with sorting using TimSort

Re: Possibly a dumb question: differences between saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset?

2014-09-22 Thread Matei Zaharia
File takes a filename to write to, while Dataset takes only a JobConf. This means that Dataset is more general (it can also save to storage systems that are not file systems, such as key-value stores), but is more annoying to use if you actually have a file. Matei On September 21, 2014 at

[jira] [Created] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3628: Summary: Don't apply accumulator updates multiple times for tasks in result stages Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project

[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142756#comment-14142756 ] Matei Zaharia commented on SPARK-3628: -- BTW the problem is that this used

[jira] [Comment Edited] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142756#comment-14142756 ] Matei Zaharia edited comment on SPARK-3628 at 9/21/14 10:43 PM

[jira] [Comment Edited] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142756#comment-14142756 ] Matei Zaharia edited comment on SPARK-3628 at 9/21/14 10:49 PM

[jira] [Updated] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3628: - Target Version/s: 1.1.1, 1.2.0, 1.0.3 (was: 1.1.1, 1.2.0, 0.9.3, 1.0.3) Don't apply accumulator

[jira] [Updated] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3628: - Target Version/s: 1.1.1, 1.2.0, 0.9.3, 1.0.3 (was: 1.1.1, 1.2.0, 1.0.3) Don't apply accumulator

[jira] [Updated] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3628: - Target Version/s: 1.1.1, 1.2.0, 0.9.3, 1.0.3 (was: 1.1.1, 1.2.0, 1.0.3) Don't apply accumulator

[jira] [Updated] (SPARK-3629) Improvements to YARN doc

2014-09-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3629: - Description: Right now this doc starts off with a big list of config options, and only

Re: A couple questions about shared variables

2014-09-21 Thread Matei Zaharia
:10 AM, Matei Zaharia wrote: Hey Sandy, On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: Hey All,  A couple questions came up about shared variables recently, and I wanted to  confirm my understanding and update the doc to be a little more clear.  *Broadcast

[jira] [Created] (SPARK-3611) Show number of cores for each executor in application web UI

2014-09-20 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3611: Summary: Show number of cores for each executor in application web UI Key: SPARK-3611 URL: https://issues.apache.org/jira/browse/SPARK-3611 Project: Spark

[jira] [Created] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2014-09-20 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3619: Summary: Upgrade to Mesos 0.21 to work around MESOS-1688 Key: SPARK-3619 URL: https://issues.apache.org/jira/browse/SPARK-3619 Project: Spark Issue Type

Re: A couple questions about shared variables

2014-09-20 Thread Matei Zaharia
Hey Sandy, On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: Hey All,  A couple questions came up about shared variables recently, and I wanted to  confirm my understanding and update the doc to be a little more clear.  *Broadcast variables*  Now that tasks data

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141382#comment-14141382 ] Matei Zaharia commented on SPARK-3129: -- So Hari, what is the maximum sustainable rate

[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139163#comment-14139163 ] Matei Zaharia commented on SPARK-2593: -- Sure, it would be great to do

Re: paging through an RDD that's too large to collect() all at once

2014-09-18 Thread Matei Zaharia
Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads one RDD partition at a time. Scala iterators also have methods like grouped() that let you get fixed-size groups. Matei On September 18, 2014 at 7:58:34 PM, dave-anderson (david.ander...@pobox.com) wrote: I have an

[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2014-09-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137719#comment-14137719 ] Matei Zaharia commented on SPARK-3530: -- To comment on the versioning stuff here

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138281#comment-14138281 ] Matei Zaharia commented on SPARK-3129: -- Great, it will be nice to see how fast

[jira] [Updated] (SPARK-2620) case class cannot be used as key for reduce

2014-09-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2620: - Affects Version/s: 1.1.0 case class cannot be used as key for reduce

[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138390#comment-14138390 ] Matei Zaharia commented on SPARK-2593: -- The reason that we don't want to expose Akka

[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138402#comment-14138402 ] Matei Zaharia commented on SPARK-2593: -- BTW doing this for the ActorReceiver

Re: Short Circuit Local Reads

2014-09-17 Thread Matei Zaharia
I'm pretty sure it does help, though I don't have any numbers for it. In any case, Spark will automatically benefit from this if you link it to a version of HDFS that contains this. Matei On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com) wrote: Cloudera had a blog post

Re: Spark as a Library

2014-09-16 Thread Matei Zaharia
If you want to run the computation on just one machine (using Spark's local mode), it can probably run in a container. Otherwise you can create a SparkContext there and connect it to a cluster outside. Note that I haven't tried this though, so the security policies of the container might be too

Re: NullWritable not serializable

2014-09-15 Thread Matei Zaharia
.count(). As you can see, count() does not need to serialize and ship data while the other three methods do. Do you recall any difference between spark 1.0 and 1.1 that might cause this problem? Thanks, Du From: Matei Zaharia matei.zaha...@gmail.com Date: Friday, September 12, 2014 at 9:10 PM

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
Scala 2.11 work is under way in open pull requests though, so hopefully it will be in soon. Matei On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: ah...thanks! On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote: No, not yet.  Spark SQL is

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
at the earliest. On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Scala 2.11 work is under way in open pull requests though, so hopefully it will be in soon. Matei On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: ah...thanks! On Mon, Sep 15

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Matei Zaharia
It's true that it does not send a kill command right now -- we should probably add that. This code was written before tasks were killable AFAIK. However, the *job* should still finish while a speculative task is running as far as I know, and it will just leave that task behind. Matei On

Re: NullWritable not serializable

2014-09-15 Thread Matei Zaharia
.count(). As you can see, count() does not need to serialize and ship data while the other three methods do. Do you recall any difference between spark 1.0 and 1.1 that might cause this problem? Thanks, Du From: Matei Zaharia matei.zaha...@gmail.com Date: Friday, September 12, 2014 at 9:10 PM

Re: Complexity/Efficiency of SortByKey

2014-09-15 Thread Matei Zaharia
sortByKey is indeed O(n log n), it's a first pass to figure out even-sized partitions (by sampling the RDD), then a second pass to do a distributed merge-sort (first partition the data on each machine, then run a reduce phase that merges the data for each partition). The point where it becomes

[jira] [Commented] (SPARK-1449) Please delete old releases from mirroring system

2014-09-14 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1412#comment-1412 ] Matei Zaharia commented on SPARK-1449: -- Hey folks, sorry for the delay -- will look

[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system

2014-09-14 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1449: - Assignee: Patrick Wendell Please delete old releases from mirroring system

Re: compiling spark source code

2014-09-14 Thread Matei Zaharia
I've seen the file name too long error when compiling on an encrypted Linux file system -- some of them have a limit on file name lengths. If you're on Linux, can you try compiling inside /tmp instead? Matei On September 13, 2014 at 10:03:14 PM, Yin Huai (huaiyin@gmail.com) wrote: Can you

Re: NullWritable not serializable

2014-09-12 Thread Matei Zaharia
Hi Du, I don't think NullWritable has ever been serializable, so you must be doing something differently from your previous program. In this case though, just use a map() to turn your Writables to serializable types (e.g. null and String). Matie On September 12, 2014 at 8:48:36 PM, Du Li

Re: NullWritable not serializable

2014-09-12 Thread Matei Zaharia
Hi Du, I don't think NullWritable has ever been serializable, so you must be doing something differently from your previous program. In this case though, just use a map() to turn your Writables to serializable types (e.g. null and String). Matie On September 12, 2014 at 8:48:36 PM, Du Li

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Matei Zaharia
Thanks to everyone who contributed to implementing and testing this release! Matei On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote: Thanks for all the good work. Very excited about seeing more features and better stability in the framework. On Thu, Sep 11, 2014 at

[jira] [Assigned] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-09-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-2048: Assignee: Matei Zaharia Optimizations to CPU usage of external spilling code

[jira] [Resolved] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-09-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2048. -- Resolution: Fixed Optimizations to CPU usage of external spilling code

[jira] [Commented] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-09-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125247#comment-14125247 ] Matei Zaharia commented on SPARK-2048: -- Yeah, sounds good, thanks for pointing

[jira] [Resolved] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2978. -- Resolution: Fixed Fix Version/s: 1.2.0 Provide an MR-style shuffle transformation

[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2978: - Assignee: Sandy Ryza Provide an MR-style shuffle transformation

[jira] [Updated] (SPARK-3444) Provide a way to easily change the log level in the Spark shell while running

2014-09-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3444: - Assignee: Holden Karau (was: Holden Karau) Provide a way to easily change the log level

[jira] [Updated] (SPARK-3444) Provide a way to easily change the log level in the Spark shell while running

2014-09-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3444: - Assignee: Holden Karau Provide a way to easily change the log level in the Spark shell while

[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2014-09-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126175#comment-14126175 ] Matei Zaharia commented on SPARK-3441: -- I agree that we should have more of a doc

[jira] [Resolved] (SPARK-3394) TakeOrdered crashes when limit is 0

2014-09-07 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3394. -- Resolution: Fixed Fix Version/s: 1.0.3 1.2.0 1.1.1

[jira] [Updated] (SPARK-3394) TakeOrdered crashes when limit is 0

2014-09-07 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3394: - Component/s: Spark Core TakeOrdered crashes when limit is 0

[jira] [Resolved] (SPARK-3353) Stage id monotonicity (parent stage should have lower stage id)

2014-09-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3353. -- Resolution: Fixed Fix Version/s: 1.2.0 Stage id monotonicity (parent stage should have

[jira] [Updated] (SPARK-3211) .take() is OOM-prone when there are empty partitions

2014-09-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3211: - Target Version/s: 1.1.1, 1.2.0 .take() is OOM-prone when there are empty partitions

[jira] [Updated] (SPARK-3211) .take() is OOM-prone when there are empty partitions

2014-09-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3211: - Assignee: Andrew Ash .take() is OOM-prone when there are empty partitions

[jira] [Resolved] (SPARK-3211) .take() is OOM-prone when there are empty partitions

2014-09-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3211. -- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 .take() is OOM-prone

[jira] [Commented] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)

2014-09-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121883#comment-14121883 ] Matei Zaharia commented on SPARK-640: - [~pwendell] what is our Hadoop 1 version on AMIs

Re: pandas-like dataframe in spark

2014-09-04 Thread Matei Zaharia
Hi Mohit, This looks pretty interesting, but just a note on the implementation -- it might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The reason is that SchemaRDDs already have an efficient in-memory representation (columnar storage), and can be read from a variety of data

<    1   2   3   4   5   6   7   8   9   10   >