Re: Building Spark against Scala 2.10.1 virtualized

2014-06-05 Thread Matei Zaharia
You can modify project/SparkBuild.scala and build Spark with sbt instead of Maven. On Jun 5, 2014, at 12:36 PM, Meisam Fathi meisam.fa...@gmail.com wrote: Hi community, How should I change sbt to compile spark core with a different version of Scala? I see maven pom files define

Re: reuse hadoop code in Spark

2014-06-05 Thread Matei Zaharia
in java and port it into Spark? Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From:Matei Zaharia matei.zaha...@gmail.com To:user

Re: Join : Giving incorrect result

2014-06-05 Thread Matei Zaharia
, June 5, 2014 1:35 AM, Matei Zaharia matei.zaha...@gmail.com wrote: If this isn’t the problem, it would be great if you can post the code for the program. Matei On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen xche...@gmail.com wrote: Maybe your two workers have different assembly jar

[jira] [Created] (SPARK-2013) Add Python pickleFile to programming guide

2014-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2013: Summary: Add Python pickleFile to programming guide Key: SPARK-2013 URL: https://issues.apache.org/jira/browse/SPARK-2013 Project: Spark Issue Type

[jira] [Created] (SPARK-2014) Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default

2014-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2014: Summary: Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default Key: SPARK-2014 URL: https://issues.apache.org/jira/browse/SPARK-2014 Project: Spark

[jira] [Updated] (SPARK-2013) Add Python pickleFile to programming guide

2014-06-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2013: - Assignee: Kan Zhang Add Python pickleFile to programming guide

[jira] [Updated] (SPARK-1912) Compression memory issue during reduce

2014-06-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1912: - Target Version/s: 0.9.2, 1.0.1, 1.1.0 (was: 0.9.2, 1.0.1) Compression memory issue during

[jira] [Updated] (SPARK-1912) Compression memory issue during reduce

2014-06-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1912: - Target Version/s: 0.9.2, 1.0.1 Compression memory issue during reduce

[jira] [Created] (SPARK-2024) Add saveAsSequenceFile to PySpark

2014-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2024: Summary: Add saveAsSequenceFile to PySpark Key: SPARK-2024 URL: https://issues.apache.org/jira/browse/SPARK-2024 Project: Spark Issue Type: New Feature

Re: Join : Giving incorrect result

2014-06-04 Thread Matei Zaharia
If this isn’t the problem, it would be great if you can post the code for the program. Matei On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen xche...@gmail.com wrote: Maybe your two workers have different assembly jar files? I just ran into a similar problem that my spark-shell is using a

Re: reuse hadoop code in Spark

2014-06-04 Thread Matei Zaharia
Yes, you can write some glue in Spark to call these. Some functions to look at: - SparkContext.hadoopRDD lets you create an input RDD from an existing JobConf configured by Hadoop (including InputFormat, paths, etc) - RDD.mapPartitions lets you operate in all the values on one partition (block)

Re: Better line number hints for logging?

2014-06-04 Thread Matei Zaharia
than just one line? (Of course you would have to click to expand it.) On Wed, Jun 4, 2014 at 2:38 AM, John Salvatier jsalvat...@gmail.com wrote: Ok, I will probably open a Jira. On Tue, Jun 3, 2014 at 5:29 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can use RDD.setName to give

Re: pyspark join crash

2014-06-04 Thread Matei Zaharia
In PySpark, the data processed by each reduce task needs to fit in memory within the Python process, so you should use more tasks to process this dataset. Data is spilled to disk across tasks. I’ve created https://issues.apache.org/jira/browse/SPARK-2021 to track this — it’s something we’ve

Re: How can I dispose an Accumulator?

2014-06-04 Thread Matei Zaharia
All of these are disposed of automatically if you stop the context or exit the program. Matei On Jun 4, 2014, at 2:22 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: Will the broadcast variables be disposed automatically if the context is stopped, or do I still need to unpersist()?

Re: pyspark join crash

2014-06-04 Thread Matei Zaharia
On Wed, Jun 4, 2014 at 1:42 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In PySpark, the data processed by each reduce task needs to fit in memory within the Python process, so you should use more tasks to process this dataset. Data is spilled to disk across tasks. I’ve created https

Re: Why Scala?

2014-06-04 Thread Matei Zaharia
to include Python APIs in Spark Streaming? Anytime frame on this? Thanks! John On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work

Re: Logistic Regression MLLib Slow

2014-06-04 Thread Matei Zaharia
Are you using the logistic_regression.py in examples/src/main/python or examples/src/main/python/mllib? The first one is an example of writing logistic regression by hand and won’t be as efficient as the MLlib one. I suggest trying the MLlib one. You may also want to check how many iterations

Re: Logistic Regression MLLib Slow

2014-06-04 Thread Matei Zaharia
. The MLLib version of logistic regression doesn't seem to use all the cores on my machine. Regards, Krishna On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Are you using the logistic_regression.py in examples/src/main/python or examples/src/main

[jira] [Commented] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017001#comment-14017001 ] Matei Zaharia commented on SPARK-1790: -- It's fine to skip the check right now; I

[jira] [Updated] (SPARK-1942) Stop clearing spark.driver.port in unit tests

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1942: - Fix Version/s: 1.1.0 Stop clearing spark.driver.port in unit tests

[jira] [Resolved] (SPARK-1912) Compression memory issue during reduce

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1912. -- Resolution: Fixed Compression memory issue during reduce

[jira] [Updated] (SPARK-1992) Support for Pivotal HD in the Maven build

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1992: - Assignee: Christian Tzolov Support for Pivotal HD in the Maven build

[jira] [Updated] (SPARK-1992) Support for Pivotal HD in the Maven build

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1992: - Fix Version/s: 1.0.1 Support for Pivotal HD in the Maven build

[jira] [Updated] (SPARK-1992) Support for Pivotal HD in the Maven build

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1992: - Issue Type: Improvement (was: Bug) Support for Pivotal HD in the Maven build

[jira] [Updated] (SPARK-1992) Support for Pivotal HD in the Maven build

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1992: - Fix Version/s: 1.1.0 Support for Pivotal HD in the Maven build

[jira] [Resolved] (SPARK-1468) The hash method used by partitionBy in Pyspark doesn't deal with None correctly.

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1468. -- Resolution: Fixed The hash method used by partitionBy in Pyspark doesn't deal with None

[jira] [Resolved] (SPARK-1161) Add saveAsObjectFile and SparkContext.objectFile in Python

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1161. -- Resolution: Fixed Merged this in -- thanks Kan! Add saveAsObjectFile

[jira] [Updated] (SPARK-1161) Add saveAsObjectFile and SparkContext.objectFile in Python

2014-06-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1161: - Fix Version/s: 1.1.0 Add saveAsObjectFile and SparkContext.objectFile in Python

Re: Add my JIRA username (hsaputra) to Spark's contributor's list

2014-06-03 Thread Matei Zaharia
Done. Looks like this was lost in the JIRA import. Matei On Jun 3, 2014, at 11:33 AM, Henry Saputra henry.sapu...@gmail.com wrote: Hi, Could someone with right karma kindly add my username (hsaputra) to Spark's contributor list? I was added before but somehow now I can no longer assign

Re: collectAsMap doesn't return a multiMap?

2014-06-03 Thread Matei Zaharia
Yup, it’s meant to be just a Map. You should probably use collect() and build a multimap instead if you’d like that. Matei On Jun 3, 2014, at 2:08 PM, Doris Xin doris.s@gmail.com wrote: Hey guys, Just wanted to check real quick if collectAsMap was by design not to return a multimap

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Matei Zaharia
Yeah unfortunately Hadoop 2 requires these binaries on Windows. Hadoop 1 runs just fine without them. Matei On Jun 3, 2014, at 10:33 AM, Sean Owen so...@cloudera.com wrote: I'd try the internet / SO first -- these are actually generic Hadoop-related issues. Here I think you don't have

Re: Better line number hints for logging?

2014-06-03 Thread Matei Zaharia
You can use RDD.setName to give it a name. There’s also a creationSite field that is private[spark] — we may want to add a public setter for that later. If the name isn’t enough and you’d like this, please open a JIRA issue for it. Matei On Jun 3, 2014, at 5:22 PM, John Salvatier

Re: Invalid Class Exception

2014-06-03 Thread Matei Zaharia
What Java version do you have, and how did you get Spark (did you build it yourself by any chance or download a pre-built one)? If you build Spark yourself you need to do it with Java 6 — it’s a known issue because of the way Java 6 and 7 package JAR files. But I haven’t seen it result in this

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-03 Thread Matei Zaharia
Ghost, it's the dream language we've theorized about for years! I hadn't realized! Indeed, glad you’re enjoying it. Matei On Mon, Jun 2, 2014 at 12:05 PM, Matei Zaharia matei.zaha...@gmail.com wrote: FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this. Matei

Re: Upgradation to Spark 1.0.0

2014-06-03 Thread Matei Zaharia
You can copy your configuration from the old one. I’d suggest just downloading it to a different location on each node first for testing, then you can delete the old one if things work. On Jun 3, 2014, at 12:38 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi , I am currently using

[jira] [Created] (SPARK-1996) Remove use of special Maven repo for Akka

2014-06-02 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1996: Summary: Remove use of special Maven repo for Akka Key: SPARK-1996 URL: https://issues.apache.org/jira/browse/SPARK-1996 Project: Spark Issue Type

Re: Eclipse Scala IDE/Scala test and Wiki

2014-06-02 Thread Matei Zaharia
Madhu, can you send me your Wiki username? (Sending it just to me is fine.) I can add you to the list to edit it. Matei On Jun 2, 2014, at 6:27 PM, Reynold Xin r...@databricks.com wrote: I tried but didn't find where I could add you. You probably need Matei to help out with this. On

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Matei Zaharia
You can just use the Maven build for now, even for Spark 1.0.0. Matei On Jun 2, 2014, at 5:30 PM, Mohit Nayak wiza...@gmail.com wrote: Hey, Yup that fixed it. Thanks so much! Is this the only solution, or could this be resolved in future versions of Spark ? On Mon, Jun 2, 2014 at

[jira] [Created] (SPARK-1989) Exit executors faster if they get into a cycle of heavy GC

2014-06-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1989: Summary: Exit executors faster if they get into a cycle of heavy GC Key: SPARK-1989 URL: https://issues.apache.org/jira/browse/SPARK-1989 Project: Spark

[jira] [Created] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7

2014-06-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1990: Summary: spark-ec2 should only need Python 2.6, not 2.7 Key: SPARK-1990 URL: https://issues.apache.org/jira/browse/SPARK-1990 Project: Spark Issue Type

[jira] [Updated] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-06-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1790: - Fix Version/s: 1.0.1 Update EC2 scripts to support r3 instance types

[jira] [Commented] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7

2014-06-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015146#comment-14015146 ] Matei Zaharia commented on SPARK-1990: -- BTW here is the first error this gets: {code

Re: ClassTag in Serializer in 1.0.0 makes non-scala callers sad panda

2014-06-01 Thread Matei Zaharia
Why do you need to call Serializer from your own program? It’s an internal developer API so ideally it would only be called to extend Spark. Are you looking to implement a custom Serializer? Matei On Jun 1, 2014, at 3:40 PM, Soren Macbeth so...@yieldbot.com wrote:

Re: ClassTag in Serializer in 1.0.0 makes non-scala callers sad panda

2014-06-01 Thread Matei Zaharia
BTW passing a ClassTag tells the Serializer what the type of object being serialized is when you compile your program, which will allow for more efficient serializers (especially on streams). Matei On Jun 1, 2014, at 4:24 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Why do you need

Re: ClassTag in Serializer in 1.0.0 makes non-scala callers sad panda

2014-06-01 Thread Matei Zaharia
it by making ClassTag object in clojure, but it's less than ideal. On Sun, Jun 1, 2014 at 4:25 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW passing a ClassTag tells the Serializer what the type of object being serialized is when you compile your program, which will allow for more

Re: ClassTag in Serializer in 1.0.0 makes non-scala callers sad panda

2014-06-01 Thread Matei Zaharia
, 2014 at 5:10 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Ah, got it. In general it will always be safe to pass the ClassTag for java.lang.Object here — this is what our Java API does to say that type info is not known. So you can always pass that. Look at the Java code for how to get

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia
More specifically with the -a flag, you *can* set your own AMI, but you’ll need to base it off ours. This is because spark-ec2 assumes that some packages (e.g. java, Python 2.6) are already available on the AMI. Matei On Jun 1, 2014, at 11:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey

Re: Trouble with EC2

2014-06-01 Thread Matei Zaharia
1, 2014, at 3:11 PM, PJ$ p...@chickenandwaffl.es wrote: Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't gotten any further. No clue what's wrong. I'd really appreciate any guidance y'all could offer. Best, PJ$ On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia
FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this. Matei On Jun 1, 2014, at 6:14 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Sort of.. there were two separate issues, but both related to AWS.. I've sorted the confusion about the Master/Worker AMI ... use

[jira] [Resolved] (SPARK-1917) PySpark fails to import functions from {{scipy.special}}

2014-05-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1917. -- Resolution: Fixed PySpark fails to import functions from {{scipy.special

[jira] [Updated] (SPARK-1917) PySpark fails to import functions from {{scipy.special}}

2014-05-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1917: - Assignee: Uri Laserson PySpark fails to import functions from {{scipy.special

Re: Trouble with EC2

2014-05-31 Thread Matei Zaharia
What instance types did you launch on? Sometimes you also get a bad individual machine from EC2. It might help to remove the node it’s complaining about from the conf/slaves file. Matei On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote: Hey Folks, I'm really having quite a

[jira] [Updated] (MESOS-53) Master should make offers even for machines with no free memory

2014-05-30 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/MESOS-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated MESOS-53: --- Assignee: (was: Matei Zaharia) Master should make offers even for machines with no free memory

[jira] [Closed] (SPARK-1784) Add a partitioner which partitions an RDD with each partition having specified # of keys

2014-05-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-1784. Resolution: Invalid Fix Version/s: (was: 1.0.0) Add a partitioner which partitions

[jira] [Updated] (SPARK-1811) Support resizable output buffer for kryo serializer

2014-05-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1811: - Assignee: Koert Kuipers Support resizable output buffer for kryo serializer

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012520#comment-14012520 ] Matei Zaharia commented on SPARK-1518: -- Sorry, I'm still not sure I understand what

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012651#comment-14012651 ] Matei Zaharia commented on SPARK-1518: -- Okay, got it. But this only applies to you

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012655#comment-14012655 ] Matei Zaharia commented on SPARK-1518: -- BTW one other thing is that in 1.0, you can

Re: Suggestion: RDD cache depth

2014-05-29 Thread Matei Zaharia
This is a pretty cool idea — instead of cache depth I’d call it something like reference counting. Would you mind opening a JIRA issue about it? The issue of really composing together libraries that use RDDs nicely isn’t fully explored, but this is certainly one thing that would help with it.

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Matei Zaharia
are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean McNamara* Xiangrui Meng* Andy Konwinski* Krishna Sankar Kevin Markey Patrick Wendell* Tathagata Das* 0: (1 vote) Ankur Dave* -1: (0 vote) Please hold off

Re: Spark hook to create external process

2014-05-29 Thread Matei Zaharia
Hi Anand, This is probably already handled by the RDD.pipe() operation. It will spawn a process and let you feed data to it through its stdin and read data through stdout. Matei On May 29, 2014, at 9:39 AM, ansriniv ansri...@gmail.com wrote: I have a requirement where for every Spark

Re: Driver OOM while using reduceByKey

2014-05-29 Thread Matei Zaharia
That hash map is just a list of where each task ran, it’s not the actual data. How many map and reduce tasks do you have? Maybe you need to give the driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, 100) to use only 100 tasks). Matei On May 29, 2014, at 2:03 AM, haitao

Re: Why Scala?

2014-05-29 Thread Matei Zaharia
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data

Re: Shuffle file consolidation

2014-05-29 Thread Matei Zaharia
It can be set in an individual application. Consolidation had some issues on ext3 as mentioned there, though we might enable it by default in the future because other optimizations now made it perform on par with the non-consolidation version. It also had some bugs in 0.9.0 so I’d suggest at

[jira] [Created] (SPARK-1945) Add full Java examples in MLlib docs

2014-05-28 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1945: Summary: Add full Java examples in MLlib docs Key: SPARK-1945 URL: https://issues.apache.org/jira/browse/SPARK-1945 Project: Spark Issue Type: Sub-task

[jira] [Resolved] (SPARK-1936) Add apache header and remove author tags

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1936. -- Resolution: Won't Fix We should not change these files' license headers because they're files

[jira] [Commented] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011548#comment-14011548 ] Matei Zaharia commented on SPARK-1790: -- Thanks Sujeet! Just post here when you have

[jira] [Updated] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1790: - Labels: Starter (was: starter) Update EC2 scripts to support r3 instance types

[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011711#comment-14011711 ] Matei Zaharia commented on SPARK-1952: -- Ryan, do you know what SLF4J version Pig

[jira] [Resolved] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1712. -- Resolution: Fixed ParallelCollectionRDD operations hanging forever without any error messages

[jira] [Updated] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1712: - Priority: Major (was: Blocker) ParallelCollectionRDD operations hanging forever without any

[jira] [Updated] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1817: - Priority: Minor (was: Blocker) RDD zip erroneous when partitions do not divide RDD count

[jira] [Updated] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1817: - Priority: Major (was: Minor) RDD zip erroneous when partitions do not divide RDD count

[jira] [Updated] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1712: - Fix Version/s: 1.0.1 ParallelCollectionRDD operations hanging forever without any error

[jira] [Commented] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011830#comment-14011830 ] Matei Zaharia commented on SPARK-1712: -- Merged the frame size check into 0.9.2

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011942#comment-14011942 ] Matei Zaharia commented on SPARK-1518: -- Sean, the model for linking to Hadoop has

[jira] [Updated] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1112: - Priority: Critical (was: Blocker) When spark.akka.frameSize 10, task results bigger than

[jira] [Commented] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-05-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011978#comment-14011978 ] Matei Zaharia commented on SPARK-1112: -- I'm curious, why did you want to make

Re: Python, Spark and HBase

2014-05-28 Thread Matei Zaharia
It sounds like you made a typo in the code — perhaps you’re trying to call self._jvm.PythonRDDnewAPIHadoopFile instead of self._jvm.PythonRDD.newAPIHadoopFile? There should be a dot before the new. Matei On May 28, 2014, at 5:25 PM, twizansk twiza...@gmail.com wrote: Hi Nick, I finally

Re: Checking spark cache percentage programatically. And how to clear cache.

2014-05-28 Thread Matei Zaharia
You can remove cached RDDs by calling unpersist() on them. You can also use SparkContext.getRDDStorageInfo to get info on cache usage, though this is a developer API so it may change in future versions. We will add a standard API eventually but this is just very closely tied to framework

[jira] [Commented] (SPARK-1566) Consolidate the Spark Programming Guide with tabs for all languages

2014-05-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010496#comment-14010496 ] Matei Zaharia commented on SPARK-1566: -- https://github.com/apache/spark/pull/896

[jira] [Updated] (SPARK-1825) Windows Spark fails to work with Linux YARN

2014-05-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1825: - Fix Version/s: (was: 1.0.0) Windows Spark fails to work with Linux YARN

[jira] [Created] (SPARK-1942) Stop clearing spark.driver.port in unit tests

2014-05-27 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1942: Summary: Stop clearing spark.driver.port in unit tests Key: SPARK-1942 URL: https://issues.apache.org/jira/browse/SPARK-1942 Project: Spark Issue Type: Task

Re: About JIRA SPARK-1825

2014-05-27 Thread Matei Zaharia
Hei Taeyun, have you sent a pull request for this fix? We can review it there. It’s too late to merge anything but blockers for 1.0.0 but we can do it for 1.0.1 or 1.1, depending how big the patch is. Matei On May 27, 2014, at 5:25 PM, innowireless TaeYun Kim taeyun@innowireless.co.kr

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Matei Zaharia
+1 Tested on Mac OS X and Windows. Matei On May 26, 2014, at 7:38 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918:

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Matei Zaharia
I think the question for me would be does this only happen when you call partitionBy, or always? And how common do you expect calls to partitionBy to be? If we can wait for 1.0.1 then I’d wait on this one. Matei On May 26, 2014, at 10:47 PM, Patrick Wendell pwend...@gmail.com wrote: Hey

[jira] [Assigned] (SPARK-1566) Consolidate the Spark Programming Guide with tabs for all languages

2014-05-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1566: Assignee: Matei Zaharia Consolidate the Spark Programming Guide with tabs for all

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Matei Zaharia
+1 Tested it on both Windows and Mac OS X, with both Scala and Python. Confirmed that the issues in the previous RC were fixed. Matei On May 20, 2014, at 5:28 PM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) I have: - checked signatures and checksums of the files - built

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Matei Zaharia
restarting the workers usually resolves this, but we often seen workers disappear after a failed or killed job. If we see this occur again, I'll try and provide some logs. On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Which version is this with? I

Re: Python, Spark and HBase

2014-05-20 Thread Matei Zaharia
Unfortunately this is not yet possible. There’s a patch in progress posted here though: https://github.com/apache/spark/pull/455 — it would be great to get your feedback on it. Matei On May 20, 2014, at 4:21 PM, twizansk twiza...@gmail.com wrote: Hello, This seems like a basic question

[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001420#comment-14001420 ] Matei Zaharia commented on SPARK-1875: -- I see, it might be fine to just remove

[jira] [Created] (SPARK-1879) Default PermGen size too small when using Hadoop2 and Hive

2014-05-19 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1879: Summary: Default PermGen size too small when using Hadoop2 and Hive Key: SPARK-1879 URL: https://issues.apache.org/jira/browse/SPARK-1879 Project: Spark

[jira] [Assigned] (SPARK-1879) Default PermGen size too small when using Hadoop2 and Hive

2014-05-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1879: Assignee: Matei Zaharia Default PermGen size too small when using Hadoop2 and Hive

[jira] [Comment Edited] (SPARK-1879) Default PermGen size too small when using Hadoop2 and Hive

2014-05-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001451#comment-14001451 ] Matei Zaharia edited comment on SPARK-1879 at 5/19/14 7:25 AM

[jira] [Commented] (SPARK-1879) Default PermGen size too small when using Hadoop2 and Hive

2014-05-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001451#comment-14001451 ] Matei Zaharia commented on SPARK-1879: -- BTW the warning on Java 8 is the following

[jira] [Commented] (SPARK-1879) Default PermGen size too small when using Hadoop2 and Hive

2014-05-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001461#comment-14001461 ] Matei Zaharia commented on SPARK-1879: -- https://github.com/apache/spark/pull/823

[jira] [Commented] (SPARK-1857) map() with lookup() causes exception

2014-05-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002275#comment-14002275 ] Matei Zaharia commented on SPARK-1857: -- The problem is that it's not currently

[jira] [Commented] (SPARK-1874) Clean up MLlib sample data

2014-05-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002369#comment-14002369 ] Matei Zaharia commented on SPARK-1874: -- Yes, cause there's other stuff in `data`. I

Re: queston about Spark repositories in GitHub

2014-05-19 Thread Matei Zaharia
“master” is where development happens, while branch-1.0, branch-0.9, etc are for maintenance releases in those versions. Most likely if you want to contribute you should use master. Some of the other named branches were for big features in the past, but none are actively used now. Matei On

Re: persist @ disk-only failing

2014-05-19 Thread Matei Zaharia
, May 19, 2014 at 1:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: What version is this with? We used to build each partition first before writing it out, but this was fixed a while back (0.9.1, but it may also be in 0.9.0). Matei On May 19, 2014, at 12:41 AM, Sai Prasanna

<    4   5   6   7   8   9   10   11   12   13   >