Re: Logistic Regression MLLib Slow

2014-06-04 Thread Matei Zaharia
Are you using the logistic_regression.py in examples/src/main/python or examples/src/main/python/mllib? The first one is an example of writing logistic regression by hand and won’t be as efficient as the MLlib one. I suggest trying the MLlib one. You may also want to check how many iterations

Re: Logistic Regression MLLib Slow

2014-06-04 Thread Matei Zaharia
. The MLLib version of logistic regression doesn't seem to use all the cores on my machine. Regards, Krishna On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Are you using the logistic_regression.py in examples/src/main/python or examples/src/main

Re: reuse hadoop code in Spark

2014-06-05 Thread Matei Zaharia
in java and port it into Spark? Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From:Matei Zaharia matei.zaha...@gmail.com To:user

Re: Join : Giving incorrect result

2014-06-05 Thread Matei Zaharia
, June 5, 2014 1:35 AM, Matei Zaharia matei.zaha...@gmail.com wrote: If this isn’t the problem, it would be great if you can post the code for the program. Matei On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen xche...@gmail.com wrote: Maybe your two workers have different assembly jar

Re: How to enable fault-tolerance?

2014-06-09 Thread Matei Zaharia
If this is a useful feature for local mode, we should open a JIRA to document the setting or improve it (I’d prefer to add a spark.local.retries property instead of a special URL format). We initially disabled it for everything except unit tests because 90% of the time an exception in local

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Matei Zaharia
You currently can’t have multiple SparkContext objects in the same JVM, but within a SparkContext, all of the APIs are thread-safe so you can share that context between multiple threads. The other issue you’ll run into is that in each thread where you want to use Spark, you need to use

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Matei Zaharia
. If we can disable the UI http Server; it would be much simpler to handle than having two http containers to deal with. Chester On Monday, June 9, 2014 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You currently can’t have multiple SparkContext objects in the same JVM

Re: How to specify executor memory in EC2 ?

2014-06-10 Thread Matei Zaharia
It might be that conf/spark-env.sh on EC2 is configured to set it to 512, and is overriding the application’s settings. Take a look in there and delete that line if possible. Matei On Jun 10, 2014, at 2:38 PM, Aliaksei Litouka aliaksei.lito...@gmail.com wrote: I am testing my application in

Re: Powered by Spark addition

2014-06-11 Thread Matei Zaharia
Alright, added you. Matei On Jun 11, 2014, at 1:28 PM, Derek Mansen de...@vistarmedia.com wrote: Hello, I was wondering if we could add our organization to the Powered by Spark page. The information is: Name: Vistar Media URL: www.vistarmedia.com Description: Location technology company

Re: When to use CombineByKey vs reduceByKey?

2014-06-11 Thread Matei Zaharia
combineByKey is designed for when your return type from the aggregation is different from the values being aggregated (e.g. you group together objects), and it should allow you to modify the leftmost argument of each function (mergeCombiners, mergeValue, etc) and return that instead of

Re: How to specify executor memory in EC2 ?

2014-06-12 Thread Matei Zaharia
and will post it if I find it :) Thank you, anyway On Wed, Jun 11, 2014 at 12:19 AM, Matei Zaharia matei.zaha...@gmail.com wrote: It might be that conf/spark-env.sh on EC2 is configured to set it to 512, and is overriding the application’s settings. Take a look in there and delete

Re: guidance on simple unit testing with Spark

2014-06-13 Thread Matei Zaharia
You need to factor your program so that it’s not just a main(). This is not a Spark-specific issue, it’s about how you’d unit test any program in general. In this case, your main() creates a SparkContext, so you can’t pass one from outside, and your code has to read data from a file and write

Re: Is shuffle stable?

2014-06-14 Thread Matei Zaharia
The order is not guaranteed actually, only which keys end up in each partition. Reducers may fetch data from map tasks in an arbitrary order, depending on which ones are available first. If you’d like a specific order, you should sort each partition. Here you might be getting it because each

Re: pyspark serializer can't handle functions?

2014-06-16 Thread Matei Zaharia
It’s true that it can’t. You can try to use the CloudPickle library instead, which is what we use within PySpark to serialize functions (see python/pyspark/cloudpickle.py). However I’m also curious, why do you need an RDD of functions? Matei On Jun 15, 2014, at 4:49 PM, madeleine

Re: pyspark serializer can't handle functions?

2014-06-16 Thread Matei Zaharia
is that I'm using alternating minimization, so I'll be minimizing over the rows and columns of this matrix at alternating steps; hence I need to store both the matrix and its transpose to avoid data thrashing. On Mon, Jun 16, 2014 at 11:05 AM, Matei Zaharia [via Apache Spark User List] [hidden

Re: Un-serializable 3rd-party classes (Spark, Java)

2014-06-17 Thread Matei Zaharia
There are a few options: - Kryo might be able to serialize these objects out of the box, depending what’s inside them. Try turning it on as described at http://spark.apache.org/docs/latest/tuning.html. - If that doesn’t work, you can create your own “wrapper” objects that implement

Re: Spark is now available via Homebrew

2014-06-18 Thread Matei Zaharia
Interesting, does anyone know the people over there who set it up? It would be good if Apache itself could publish packages there, though I’m not sure what’s involved. Since Spark just depends on Java and Python it should be easy for us to update. Matei On Jun 18, 2014, at 1:37 PM, Nick

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Matei Zaharia
I was going to suggest the same thing :). On Jun 18, 2014, at 4:56 PM, Evan R. Sparks evan.spa...@gmail.com wrote: This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int,

Re: MLLib inside Storm : silly or not ?

2014-06-19 Thread Matei Zaharia
You should be able to use many of the MLlib Model objects directly in Storm, if you save them out using Java serialization. The only one that won’t work is probably ALS, because it’s a distributed model. Otherwise, you will have to output them in your own format and write code for evaluating

Re: Powered by Spark addition

2014-06-21 Thread Matei Zaharia
customer targetting, accurate inventory and efficient analysis. Thanks! Best Regards, Sonal Nube Technologies On Thu, Jun 12, 2014 at 11:33 PM, Derek Mansen de...@vistarmedia.com wrote: Awesome, thank you! On Wed, Jun 11, 2014 at 6:53 PM, Matei Zaharia matei.zaha

Re: Spark 1.0: Unable to Read LZO Compressed File

2014-07-01 Thread Matei Zaharia
I’d suggest asking the IBM Hadoop folks, but my guess is that the library cannot be found in /opt/IHC/lib/native/Linux-amd64-64/. Or maybe if this exception is happening in your driver program, the driver program’s java.library.path doesn’t include this. (SPARK_LIBRARY_PATH from spark-env.sh

Re: java options for spark-1.0.0

2014-07-02 Thread Matei Zaharia
Try looking at the running processes with “ps” to see their full command line and see whether any options are different. It seems like in both cases, your young generation is quite large (11 GB), which doesn’t make lot of sense with a heap of 15 GB. But maybe I’m misreading something. Matei

Re: Shark Vs Spark SQL

2014-07-02 Thread Matei Zaharia
Spark SQL in Spark 1.1 will include all the functionality in Shark; take a look at http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html. We decided to do this because at the end of the day, the only code left in Shark was the JDBC / Thrift

Re: AWS Credentials for private S3 reads

2014-07-02 Thread Matei Zaharia
When you use hadoopConfiguration directly, I don’t think you have to replace the “/“ with “%2f”. Have you tried it without that? Also make sure you’re not replacing slashes in the URL itself. Matei On Jul 2, 2014, at 4:17 PM, Brian Gawalt bgaw...@gmail.com wrote: Hello everyone, I'm

Re: the Pre-built packages for CDH4 can not support yarn ?

2014-07-07 Thread Matei Zaharia
They are for CDH4 without YARN, since YARN is experimental in that. You can download one of the Hadoop 2 packages if you want to run on YARN. Or you might have to build specifically against CDH4's version of YARN if that doesn't work. Matei On Jul 7, 2014, at 9:37 PM, ch huang

Re: Document page load fault

2014-07-08 Thread Matei Zaharia
Thanks for catching this. For now you can just access the page through http:// instead of https:// to avoid this. Matei On Jul 8, 2014, at 10:46 PM, binbinbin915 binbinbin...@live.cn wrote: https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression on Chrome 35 with

Re: spark ui on yarn

2014-07-12 Thread Matei Zaharia
The UI code is the same in both, but one possibility is that your executors were given less memory on YARN. Can you check that? Or otherwise, how do you know that some RDDs were cached? Matei On Jul 12, 2014, at 4:12 PM, Koert Kuipers ko...@tresata.com wrote: hey shuo, so far all stage

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Matei Zaharia
You currently can't use SparkContext inside a Spark task, so in this case you'd have to call some kind of local K-means library. One example you can try to use is Weka (http://www.cs.waikato.ac.nz/ml/weka/). You can then load your text files as an RDD of strings with SparkContext.wholeTextFiles

Re: Ideal core count within a single JVM

2014-07-14 Thread Matei Zaharia
Are you increasing the number of parallel tasks with cores as well? With more tasks there will be more data communicated and hence more calls to these functions. Unfortunately contention is kind of hard to measure, since often the result is that you see many cores idle as they're waiting on a

Re: Memory compute-intensive tasks

2014-07-14 Thread Matei Zaharia
I think coalesce with shuffle=true will force it to have one task per node. Without that, it might be that due to data locality it decides to launch multiple ones on the same node even though the total # of tasks is equal to the # of nodes. If this is the *only* thing you run on the cluster,

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Matei Zaharia
Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus

Re: Spark 1.0.1 EC2 - Launching Applications

2014-07-14 Thread Matei Zaharia
The script should be there, in the spark/bin directory. What command did you use to launch the cluster? Matei On Jul 14, 2014, at 1:12 PM, Josh Happoldt josh.happo...@trueffect.com wrote: Hi All, I've used the spark-ec2 scripts to build a simple 1.0.1 Standalone cluster on EC2. It

Re: Ideal core count within a single JVM

2014-07-14 Thread Matei Zaharia
of them here, but if your file is big it will also have at least one task per 32 MB block of the file. Matei On Jul 14, 2014, at 6:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I see, so here might be the problem. With more cores, there's less memory available per core, and now many of your

Re: hdfs replication on saving RDD

2014-07-14 Thread Matei Zaharia
You can change this setting through SparkContext.hadoopConfiguration, or put the conf/ directory of your Hadoop installation on the CLASSPATH when you launch your app so that it reads the config values from there. Matei On Jul 14, 2014, at 8:06 PM, valgrind_girl 124411...@qq.com wrote: eager

Re: Iteration question

2014-07-15 Thread Matei Zaharia
Hi Nathan, I think there are two possible reasons for this. One is that even though you are caching RDDs, their lineage chain gets longer and longer, and thus serializing each RDD takes more time. You can cut off the chain by using RDD.checkpoint() periodically, say every 5-10 iterations. The

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Matei Zaharia
Yeah, this is handled by the commit call of the FileOutputFormat. In general Hadoop OutputFormats have a concept called committing the output, which you should do only once per partition. In the file ones it does an atomic rename to make sure that the final output is a complete file. Matei On

Re: Release date for new pyspark

2014-07-16 Thread Matei Zaharia
Yeah, we try to have a regular 3 month release cycle; see https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current window. Matei On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote: You should expect master to compile and run: patches aren't merged

Re: preservesPartitioning

2014-07-17 Thread Matei Zaharia
Hi Kamal, This is not what preservesPartitioning does -- actually what it means is that if the RDD has a Partitioner set (which means it's an RDD of key-value pairs and the keys are grouped into a known way, e.g. hashed or range-partitioned), your map function is not changing the partition of

Re: Spark scheduling with Capacity scheduler

2014-07-17 Thread Matei Zaharia
It's possible using the --queue argument of spark-submit. Unfortunately this is not documented on http://spark.apache.org/docs/latest/running-on-yarn.html but it appears if you just type spark-submit --help or spark-submit with no arguments. Matei On Jul 17, 2014, at 2:33 AM, Konstantin

Re: Include permalinks in mail footer

2014-07-17 Thread Matei Zaharia
Good question.. I'll ask INFRA because I haven't seen other Apache mailing lists provide this. It would indeed be helpful. Matei On Jul 17, 2014, at 12:59 PM, Nick Chammas nicholas.cham...@gmail.com wrote: Can we modify the mailing list to include permalinks to the thread in the footer of

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-20 Thread Matei Zaharia
Is this with the 1.0.0 scripts? I believe it's fixed in 1.0.1. Matei On Jul 20, 2014, at 1:22 AM, Chris DuBois chris.dub...@gmail.com wrote: Using the spark-ec2 script with m3.2xlarge instances seems to not have /mnt and /mnt2 pointing to the 80gb SSDs that come with that instance. Does

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-21 Thread Matei Zaharia
Actually the script in the master branch is also broken (it's pointing to an older AMI). Try 1.0.1 for launching clusters. On Jul 20, 2014, at 2:25 PM, Chris DuBois chris.dub...@gmail.com wrote: I pulled the latest last night. I'm on commit 4da01e3. On Sun, Jul 20, 2014 at 2:08 PM, Matei

Re: Very wierd behavior

2014-07-22 Thread Matei Zaharia
Is the first() being computed locally on the driver program? Maybe it's to hard to compute with the memory, etc available there. Take a look at the driver's log and see whether it has the message Computing the requested partition locally. Matei On Jul 22, 2014, at 12:04 PM, Nathan Kronenfeld

Re: akka 2.3.x?

2014-07-24 Thread Matei Zaharia
This is being tracked here: https://issues.apache.org/jira/browse/SPARK-1812, since it will also be needed for cross-building with Scala 2.11. Maybe we can do it before that. Probably too late for 1.1, but you should open an issue for 1.2. In that JIRA I linked, there's a pull request from a

Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-24 Thread Matei Zaharia
The Pair ones return a JavaPairRDD, which has additional operations on key-value pairs. Take a look at http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs for details. Matei On Jul 24, 2014, at 3:41 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: Can

Re: Spilling in-memory... messages in log even with MEMORY_ONLY

2014-07-26 Thread Matei Zaharia
These messages are actually not about spilling the RDD, they're about spilling intermediate state in a reduceByKey, groupBy or other operation whose state doesn't fit in memory. We have to do that in these cases to avoid going out of memory. You can minimize spilling by having more reduce tasks

Re: Spilling in-memory... messages in log even with MEMORY_ONLY

2014-07-26 Thread Matei Zaharia
Even in local mode, Spark serializes data that would be sent across the network, e.g. in a reduce operation, so that you can catch errors that would happen in distributed mode. You can make serialization much faster by using the Kryo serializer; see

Re: Spark MLlib vs BIDMach Benchmark

2014-07-27 Thread Matei Zaharia
These numbers are from GPUs and Intel MKL (a closed-source math library for Intel processors), where for CPU-bound algorithms you are going to get faster speeds than MLlib's JBLAS. However, there's in theory nothing preventing the use of these in MLlib (e.g. if you have a faster BLAS locally;

Re: Job using Spark for Machine Learning

2014-07-29 Thread Matei Zaharia
Hi Martin, Job ads are actually not allowed on the list, but thanks for asking. Just posting this for others' future reference. Matei On July 29, 2014 at 8:34:59 AM, Martin Goodson (mar...@skimlinks.com) wrote: I'm not sure if job adverts are allowed on here - please let me know if not. 

Re: Do I need to know Scala to take full advantage of spark?

2014-07-30 Thread Matei Zaharia
Java is very close to Scala across the board, the only thing missing in it right now is GraphX (which is still alpha). Python is missing GraphX, streaming and a few of the ML algorithms, though most of them are there. So it should be fine to start with  any of them. See 

Re: correct upgrade process

2014-08-01 Thread Matei Zaharia
This should be okay, but make sure that your cluster also has the right code deployed. Maybe you have the wrong one. If you built Spark from source multiple times, you may also want to try sbt clean before sbt assembly. Matei On August 1, 2014 at 12:00:07 PM, SK (skrishna...@gmail.com) wrote:

Re: Spark Training Course?

2014-08-04 Thread Matei Zaharia
This looks pretty comprehensive to me. A few quick suggestions: - On the VM part: we've actually been avoiding this in all the Databricks training efforts because the VM itself can be annoying to install and it makes it harder for people to really use Spark for development (they can learn it,

Re: Create a new object by given classtag

2014-08-04 Thread Matei Zaharia
To get the ClassTag object inside your function with the original syntax you used (T: ClassTag), you can do this: def read[T: ClassTag](): T = {   val ct = classTag[T]   ct.runtimeClass.newInstance().asInstanceOf[T] } Passing the ClassTag with : ClassTag lets you have an implicit parameter that

Re: Include permalinks in mail footer

2014-08-05 Thread Matei Zaharia
Emails sent from Nabble have it, while others don't. Unfortunately I haven't received a reply from ASF infra on this yet. Matei On August 5, 2014 at 2:04:10 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: Looks like this feature has been turned off. Are these changes intentional? Or

Re: Include permalinks in mail footer

2014-08-05 Thread Matei Zaharia
Oh actually sorry, it looks like infra has looked at it but they can't add permalinks. They can only add here's how to unsubscribe footers. My bad, I just didn't catch the email update from them. Matei On August 5, 2014 at 2:39:45 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: Emails sent

Re: No space left on device

2014-08-09 Thread Matei Zaharia
Your map-only job should not be shuffling, but if you want to see what's running, look at the web UI at http://driver:4040. In fact the job should not even write stuff to disk except inasmuch as the Hadoop S3 library might build up blocks locally before sending them on. My guess is that it's

Re: DistCP - Spark-based

2014-08-12 Thread Matei Zaharia
Good question; I don't know of one but I believe people at Cloudera had some thoughts of porting Sqoop to Spark in the future, and maybe they'd consider DistCP as part of this effort. I agree it's missing right now. Matei On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com)

Re: Lost executors

2014-08-13 Thread Matei Zaharia
What is your Spark executor memory set to? (You can see it in Spark's web UI at http://driver:4040 under the executors tab). One thing to be aware of is that the JVM never really releases memory back to the OS, so it will keep filling up to the maximum heap size you set. Maybe 4 executors with

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Matei Zaharia
Thanks for sharing this, Brandon! Looks like a great architecture for people to build on. Matei On August 15, 2014 at 2:07:06 PM, Brandon Amos (a...@adobe.com) wrote: Hi Spark community, At Adobe Research, we're happy to open source a prototype technology called Spindle we've been

Re: why classTag not typeTag?

2014-08-22 Thread Matei Zaharia
TypeTags are unfortunately not thread-safe in Scala 2.10. They were still somewhat experimental at the time so we decided not to use them. If you want though, you can probably design other APIs that pass a TypeTag around (e.g. make a method that takes an RDD[T] but also requires an implicit

Re: Installation On Windows machine

2014-08-22 Thread Matei Zaharia
You should be able to just download / unzip a Spark release and run it on a Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd. The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on Windows, but you can launch a standalone cluster manually using

Re: spark and matlab

2014-08-25 Thread Matei Zaharia
Have you tried the pipe() operator? It should work if you can launch your script from the command line. Just watch out for any environment variables needed (you can pass them to pipe() as an optional argument if there are some). On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Matei Zaharia
It seems to be because you went there with https:// instead of http://. That said, we'll fix it so that it works on both protocols. Matei On August 25, 2014 at 1:56:16 PM, Nick Chammas (nicholas.cham...@gmail.com) wrote: https://spark.apache.org/screencasts/1-first-steps-with-spark.html The

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Matei Zaharia
It should be fixed now. Maybe you have a cached version of the page in your browser. Open DevTools (cmd-shift-I), press the gear icon, and check disable cache while devtools open, then refresh the page to refresh without cache. Matei On August 26, 2014 at 7:31:18 AM, Nicholas Chammas

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Matei Zaharia
Is this a standalone mode cluster? We don't currently make this guarantee, though it will likely work in 1.0.0 to 1.0.2. The problem though is that the standalone mode grabs the executors' version of Spark code from what's installed on the cluster, while your driver might be built against

Re: Parsing Json object definition spanning multiple lines

2014-08-26 Thread Matei Zaharia
You can use sc.wholeTextFiles to read each file as a complete String, though it requires each file to be small enough for one task to process. On August 26, 2014 at 4:01:45 PM, Chris Fregly (ch...@fregly.com) wrote: i've seen this done using mapPartitions() where each partition represents a

Re: CUDA in spark, especially in MLlib?

2014-08-26 Thread Matei Zaharia
You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Matei Zaharia
connect to an existing 1.0.0 cluster and see what what happens... Thanks, Matei :) On Tue, Aug 26, 2014 at 6:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Is this a standalone mode cluster? We don't currently make this guarantee, though it will likely work in 1.0.0 to 1.0.2. The problem

Re: SchemaRDD

2014-08-27 Thread Matei Zaharia
I think this will increasingly be its role, though it doesn't make sense to use it to core because it is clearly just a client of the core APIs. What usage do you have in mind in particular? It would be nice to know how the non-SQL APIs for this could be better. Matei On August 27, 2014 at

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2014-08-27 Thread Matei Zaharia
You can use spark-shell -i file.scala to run that. However, that keeps the interpreter open at the end, so you need to make your file end with System.exit(0) (or even more robustly, do stuff in a try {} and add that in finally {}). In general it would be better to compile apps and run them

Re: Update on Pig on Spark initiative

2014-08-27 Thread Matei Zaharia
Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Matei Zaharia
Yes, executors run one task per core of your machine by default. You can also manually launch them with more worker threads than you have cores. What cluster manager are you on? Matei On August 29, 2014 at 11:24:33 AM, Victor Tso-Guillen (v...@paxata.com) wrote: I'm thinking of local mode

Re: Mapping Hadoop Reduce to Spark

2014-08-30 Thread Matei Zaharia
In 1.1, you'll be able to get all of these properties using sortByKey, and then mapPartitions on top to iterate through the key-value pairs. Unfortunately sortByKey does not let you control the Partitioner, but it's fairly easy to write your own version that does if this is important. In

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Matei Zaharia
. does it apply to both sides of the join, or only one (while othe other side is streaming)? On Sat, Aug 30, 2014 at 1:30 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In 1.1, you'll be able to get all of these properties using sortByKey, and then mapPartitions on top to iterate through the key

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Matei Zaharia
, Steve Lewis (lordjoe2...@gmail.com) wrote: Is there a sample of how to do this - I see 1.1 is out but cannot find samples of mapPartitions A Java sample would be very useful  On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia matei.zaha...@gmail.com wrote: In 1.1, you'll be able to get all

Re: pandas-like dataframe in spark

2014-09-04 Thread Matei Zaharia
Hi Mohit, This looks pretty interesting, but just a note on the implementation -- it might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The reason is that SchemaRDDs already have an efficient in-memory representation (columnar storage), and can be read from a variety of data

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Matei Zaharia
and what do I expect to see in switching from one partition to another as the code runs? On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia matei.zaha...@gmail.com wrote: In 1.1, you'll be able to get all of these properties using sortByKey, and then mapPartitions on top to iterate through the key-value

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Matei Zaharia
BTW you can also use rdd.partitions() to get a list of Partition objects and see how many there are. On September 4, 2014 at 5:18:30 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: Partitioners also work in local mode, the only question is how to see which data fell into each partition

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Matei Zaharia
Thanks to everyone who contributed to implementing and testing this release! Matei On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote: Thanks for all the good work. Very excited about seeing more features and better stability in the framework. On Thu, Sep 11, 2014 at

Re: NullWritable not serializable

2014-09-12 Thread Matei Zaharia
Hi Du, I don't think NullWritable has ever been serializable, so you must be doing something differently from your previous program. In this case though, just use a map() to turn your Writables to serializable types (e.g. null and String). Matie On September 12, 2014 at 8:48:36 PM, Du Li

Re: compiling spark source code

2014-09-14 Thread Matei Zaharia
I've seen the file name too long error when compiling on an encrypted Linux file system -- some of them have a limit on file name lengths. If you're on Linux, can you try compiling inside /tmp instead? Matei On September 13, 2014 at 10:03:14 PM, Yin Huai (huaiyin@gmail.com) wrote: Can you

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
Scala 2.11 work is under way in open pull requests though, so hopefully it will be in soon. Matei On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: ah...thanks! On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote: No, not yet.  Spark SQL is

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
at the earliest. On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Scala 2.11 work is under way in open pull requests though, so hopefully it will be in soon. Matei On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: ah...thanks! On Mon, Sep 15

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Matei Zaharia
It's true that it does not send a kill command right now -- we should probably add that. This code was written before tasks were killable AFAIK. However, the *job* should still finish while a speculative task is running as far as I know, and it will just leave that task behind. Matei On

Re: NullWritable not serializable

2014-09-15 Thread Matei Zaharia
.count(). As you can see, count() does not need to serialize and ship data while the other three methods do. Do you recall any difference between spark 1.0 and 1.1 that might cause this problem? Thanks, Du From: Matei Zaharia matei.zaha...@gmail.com Date: Friday, September 12, 2014 at 9:10 PM

Re: Complexity/Efficiency of SortByKey

2014-09-15 Thread Matei Zaharia
sortByKey is indeed O(n log n), it's a first pass to figure out even-sized partitions (by sampling the RDD), then a second pass to do a distributed merge-sort (first partition the data on each machine, then run a reduce phase that merges the data for each partition). The point where it becomes

Re: Spark as a Library

2014-09-16 Thread Matei Zaharia
If you want to run the computation on just one machine (using Spark's local mode), it can probably run in a container. Otherwise you can create a SparkContext there and connect it to a cluster outside. Note that I haven't tried this though, so the security policies of the container might be too

Re: Short Circuit Local Reads

2014-09-17 Thread Matei Zaharia
I'm pretty sure it does help, though I don't have any numbers for it. In any case, Spark will automatically benefit from this if you link it to a version of HDFS that contains this. Matei On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com) wrote: Cloudera had a blog post

Re: paging through an RDD that's too large to collect() all at once

2014-09-18 Thread Matei Zaharia
Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads one RDD partition at a time. Scala iterators also have methods like grouped() that let you get fixed-size groups. Matei On September 18, 2014 at 7:58:34 PM, dave-anderson (david.ander...@pobox.com) wrote: I have an

Re: Possibly a dumb question: differences between saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset?

2014-09-22 Thread Matei Zaharia
File takes a filename to write to, while Dataset takes only a JobConf. This means that Dataset is more general (it can also save to storage systems that are not file systems, such as key-value stores), but is more annoying to use if you actually have a file. Matei On September 21, 2014 at

Re: Spark Code to read RCFiles

2014-09-23 Thread Matei Zaharia
Is your file managed by Hive (and thus present in a Hive metastore)? In that case, Spark SQL (https://spark.apache.org/docs/latest/sql-programming-guide.html) is the easiest way. Matei On September 23, 2014 at 2:26:10 PM, Pramod Biligiri (pramodbilig...@gmail.com) wrote: Hi, I'm trying to

Re: run scalding on spark

2014-10-01 Thread Matei Zaharia
Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects. Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko...@tresata.com wrote: well, sort of! we make input/output formats (cascading taps,

Re: Spark And Mapr

2014-10-01 Thread Matei Zaharia
It should just work in PySpark, the same way it does in Java / Scala apps. Matei On Oct 1, 2014, at 4:12 PM, Sungwook Yoon sy...@maprtech.com wrote: Yes.. you should use maprfs:// I personally haven't used pyspark, I just used scala shell or standalone with MapR. I think you need to

Re: Multiple spark shell sessions

2014-10-01 Thread Matei Zaharia
You need to set --total-executor-cores to limit how many total cores it grabs on the cluster. --executor-cores is just for each individual executor, but it will try to launch many of them. Matei On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Matei Zaharia
The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd write this in HiveQL, and then try doing that with HiveContext.

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full. Matei On Oct 7, 2014, at 11:04 PM, Haopu Wang

Re: Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Matei Zaharia
A SchemaRDD is still an RDD, so you can just do rdd.map(row = row.toString). Or if you want to get a particular field of the row, you can do rdd.map(row = row(3).toString). Matei On Oct 9, 2014, at 1:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote: I've a SchemaRDD that I want to

Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Matei Zaharia
Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at

Re: add Boulder-Denver Spark meetup to list on website

2014-10-10 Thread Matei Zaharia
Added you, thanks! (You may have to shift-refresh the page to see it updated). Matei On Oct 10, 2014, at 1:52 PM, Michael Oczkowski michael.oczkow...@seeq.com wrote: Please add the Boulder-Denver Spark meetup group to the list on the website.

Re: Blog post: An Absolutely Unofficial Way to Connect Tableau to SparkSQL (Spark 1.1)

2014-10-11 Thread Matei Zaharia
Very cool Denny, thanks for sharing this! Matei On Oct 11, 2014, at 9:46 AM, Denny Lee denny.g@gmail.com wrote: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql If you're wondering how to connect Tableau to SparkSQL - here are the steps to connect Tableau to SparkSQL.

<    1   2   3   >