Re: Why Apache Spark doesn't use Calcite?

2020-01-13 Thread Matei Zaharia
I’m pretty sure that Catalyst was built before Calcite, or at least in parallel. Calcite 1.0 was only released in 2015. From a technical standpoint, building Catalyst in Scala also made it more concise and easier to extend than an optimizer written in Java (you can find various presentations

Re: Spark 2.4.0 artifact in Maven repository

2018-11-06 Thread Matei Zaharia
Hi Bartosz, This is because the vote on 2.4 has passed (you can see the vote thread on the dev mailing list) and we are just working to get the release into various channels (Maven, PyPI, etc), which can take some time. Expect to see an announcement soon once that’s done. Matei > On Nov 4,

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
s prefer to get that notification sooner > rather than later? > > On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia > wrote: > I’d like to understand the maintenance burden of Python 2 before deprecating > it. Since it is not EOL yet, it might make sense to only deprecate it once > it’s

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
I’d like to understand the maintenance burden of Python 2 before deprecating it. Since it is not EOL yet, it might make sense to only deprecate it once it’s EOL (which is still over a year from now). Supporting Python 2+3 seems less burdensome than supporting, say, multiple Scala versions in

Re: Is there any open source framework that converts Cypher to SparkSQL?

2018-09-16 Thread Matei Zaharia
GraphFrames (https://graphframes.github.io) offers a Cypher-like syntax that then executes on Spark SQL. > On Sep 14, 2018, at 2:42 AM, kant kodali wrote: > > Hi All, > > Is there any open source framework that converts Cypher to SparkSQL? > > Thanks!

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Matei Zaharia
Maybe your application is overriding the master variable when it creates its SparkContext. I see you are still passing “yarn-client” as an argument later to it in your command. > On Jun 17, 2018, at 11:53 AM, Raymond Xie wrote: > > Thank you Subhash. > > Here is the new command: >

Re: Spark 1.x - End of life

2017-10-19 Thread Matei Zaharia
Hi Ismael, It depends on what you mean by “support”. In general, there won’t be new feature releases for 1.X (e.g. Spark 1.7) because all the new features are being added to the master branch. However, there is always room for bug fix releases if there is a catastrophic bug, and committers can

Re: Kill Spark Streaming JOB from Spark UI or Yarn

2017-08-27 Thread Matei Zaharia
The batches should all have the same application ID, so use that one. You can also find the application in the YARN UI to terminate it from there. Matei > On Aug 27, 2017, at 10:27 AM, KhajaAsmath Mohammed > wrote: > > Hi, > > I am new to spark streaming and not

Re: real world spark code

2017-07-25 Thread Matei Zaharia
You can also find a lot of GitHub repos for external packages here: http://spark.apache.org/third-party-projects.html Matei > On Jul 25, 2017, at 5:30 PM, Frank Austin Nothaft > wrote: > > There’s a number of real-world open source Spark applications in the sciences: >

Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread Matei Zaharia
The Kafka source will only appear in 2.0.2 -- see this thread for the current release candidate: https://lists.apache.org/thread.html/597d630135e9eb3ede54bb0cc0b61a2b57b189588f269a64b58c9243@%3Cdev.spark.apache.org%3E . You can try that right now if you want from the staging Maven repo shown

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Matei Zaharia
I think people explained this pretty well, but in practice, this distinction is also somewhat of a marketing term, because every system will perform some kind of batching. For example, every time you use TCP, the OS and network stack may buffer multiple messages together and send them at once;

Re: unsubscribe

2016-08-10 Thread Matei Zaharia
To unsubscribe, please send an email to user-unsubscr...@spark.apache.org from the address you're subscribed from. Matei > On Aug 10, 2016, at 12:48 PM, Sohil Jain wrote: > > - To unsubscribe

Re: Dropping late date in Structured Streaming

2016-08-06 Thread Matei Zaharia
Yes, a built-in mechanism is planned in future releases. You can also drop it using a filter for now but the stateful operators will still keep state for old windows. Matei > On Aug 6, 2016, at 9:40 AM, Amit Sela wrote: > > I've noticed that when using Structured

Re: The Future Of DStream

2016-07-27 Thread Matei Zaharia
Yup, they will definitely coexist. Structured Streaming is currently alpha and will probably be complete in the next few releases, but Spark Streaming will continue to exist, because it gives the user more low-level control. It's similar to DataFrames vs RDDs (RDDs are the lower-level API for

Updated Spark logo

2016-06-10 Thread Matei Zaharia
Hi all, FYI, we've recently updated the Spark logo at https://spark.apache.org/ to say "Apache Spark" instead of just "Spark". Many ASF projects have been doing this recently to make it clearer that they are associated with the ASF, and indeed the ASF's branding guidelines generally require

Re: Apache Spark Slack

2016-05-16 Thread Matei Zaharia
I don't think any of the developers use this as an official channel, but all the ASF IRC channels are indeed on FreeNode. If there's demand for it, we can document this on the website and say that it's mostly for users to find other users. Development discussions should happen on the dev

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Matei Zaharia
This sounds good to me as well. The one thing we should pay attention to is how we update the docs so that people know to start with the spark.ml classes. Right now the docs list spark.mllib first and also seem more comprehensive in that area than in spark.ml, so maybe people naturally move

Re: simultaneous actions

2016-01-17 Thread Matei Zaharia
able to dispatch jobs from both actions simultaneously (or on a > when-workers-become-available basis)? > > On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com > <mailto:ko...@tresata.com>> wrote: > we run multiple actions on the same (cached) rdd a

Re: simultaneous actions

2016-01-15 Thread Matei Zaharia
RDDs actually are thread-safe, and quite a few applications use them this way, e.g. the JDBC server. Matei > On Jan 15, 2016, at 2:10 PM, Jakob Odersky wrote: > > I don't think RDDs are threadsafe. > More fundamentally however, why would you want to run RDD actions in >

Re: Compiling only MLlib?

2016-01-15 Thread Matei Zaharia
Have you tried just downloading a pre-built package, or linking to Spark through Maven? You don't need to build it unless you are changing code inside it. Check out http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications for how to link to it. Matei > On Jan 15,

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Matei Zaharia
In production, I'd recommend using IAM roles to avoid having keys altogether. Take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html. Matei > On Jan 11, 2016, at 11:32 AM, Sabarish Sasidharan > wrote: > > If you are

Re: How to compile Spark with customized Hadoop?

2015-10-09 Thread Matei Zaharia
You can publish your version of Hadoop to your Maven cache with mvn publish (just give it a different version number, e.g. 2.7.0a) and then pass that as the Hadoop version to Spark's build (see http://spark.apache.org/docs/latest/building-spark.html

Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia
If you run on YARN, you can use Kerberos, be authenticated as the right user, etc in the same way as MapReduce jobs. Matei > On Sep 3, 2015, at 1:37 PM, Daniel Schulz > wrote: > > Hi, > > I really enjoy using Spark. An obstacle to sell it to our clients

Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia
entitled to read/write? Will > it enforce HDFS ACLs and Ranger policies as well? > > Best regards, Daniel. > > > On 03 Sep 2015, at 21:16, Matei Zaharia <matei.zaha...@gmail.com > > <mailto:matei.zaha...@gmail.com>> wrote: > > > > If you ru

Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Matei Zaharia
Thus means that one of your cached RDD partitions is bigger than 2 GB of data. You can fix it by having more partitions. If you read data from a file system like HDFS or S3, set the number of partitions higher in the sc.textFile, hadoopFile, etc methods (it's an optional second parameter to

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
[4,5,6] can be invoked before the operation for offset [1,2,3] 2) If you wanted to achieve something similar to what TridentState does, you'll have to do it yourself (for example using Zookeeper) Is this a correct understanding? On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia matei.zaha

Re: Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread Matei Zaharia
This happens automatically when you use the byKey operations, e.g. reduceByKey, updateStateByKey, etc. Spark Streaming keeps the state for a given set of keys on a specific node and sends new tuples with that key to that. Matei On Jun 3, 2015, at 6:31 AM, allonsy luke1...@gmail.com wrote:

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
:-UseCompressedOops SPARK_DRIVER_MEMORY=129G spark version: 1.1.1 Thank you a lot for your help! 2015-06-02 4:40 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com: As long as you don't use cache(), these operations will go from disk to disk, and will only use

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
? Thank you! 2015-06-02 21:25 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com: You shouldn't have to persist the RDD at all, just call flatMap and reduce on it directly. If you try to persist it, that will try to load the original dat into memory, but here

Re: Spark logo license

2015-05-19 Thread Matei Zaharia
Check out Apache's trademark guidelines here: http://www.apache.org/foundation/marks/ http://www.apache.org/foundation/marks/ Matei On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com wrote: What is the license on using the spark logo. Is it free to be used for displaying

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Hey Tom, Are you using the fine-grained or coarse-grained scheduler? For the coarse-grained scheduler, there is a spark.cores.max config setting that will limit the total # of cores it grabs. This was there in earlier versions too. Matei On May 19, 2015, at 12:39 PM, Thomas Dudziak

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
of tasks per job :) cheers, Tom On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Hey Tom, Are you using the fine-grained or coarse-grained scheduler? For the coarse-grained scheduler, there is a spark.cores.max config

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
...This is madness! On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote: Hi there, We have released our real-time aggregation engine based on Spark Streaming. SPARKTA is fully open source (Apache2) You can checkout the slides showed up at the Strata past week:

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
(Sorry, for non-English people: that means it's a good thing.) Matei On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com wrote: ...This is madness! On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote: Hi there, We have released our real-time

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Matei Zaharia
It could also be that your hash function is expensive. What is the key class you have for the reduceByKey / groupByKey? Matei On May 12, 2015, at 10:08 AM, Night Wolf nightwolf...@gmail.com wrote: I'm seeing a similar thing with a slightly different stack trace. Ideas?

Re: Spark on Windows

2015-04-16 Thread Matei Zaharia
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to Windows. AFAIK it should build on Windows too, the only problem is that Maven might take a long time to download dependencies. What errors are you seeing? Matei On Apr 16, 2015, at 9:23 AM, Arun Lists

Re: Dataset announcement

2015-04-15 Thread Matei Zaharia
Very neat, Olivier; thanks for sharing this. Matei On Apr 15, 2015, at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote: Dear Spark users, I would like to draw your attention to a dataset that we recently released, which is as of now the largest machine learning dataset ever released;

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Matei Zaharia
Feel free to send a pull request to fix the doc (or say which versions it's needed in). Matei On Mar 20, 2015, at 6:49 PM, Krishna Sankar ksanka...@gmail.com wrote: Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers k/

Re: Querying JSON in Spark SQL

2015-03-16 Thread Matei Zaharia
The programming guide has a short example: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. Note that once you infer a schema for a JSON dataset, you can also use nested path notation

Re: Berlin Apache Spark Meetup

2015-02-17 Thread Matei Zaharia
Thanks! I've added you. Matei On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu ra...@the4thfloor.eu wrote: Hi, there is a small Spark Meetup group in Berlin, Germany :-) http://www.meetup.com/Berlin-Apache-Spark-Meetup/ Plaes add this group to the Meetups list at

Re: Beginner in Spark

2015-02-06 Thread Matei Zaharia
You don't need HDFS or virtual machines to run Spark. You can just download it, unzip it and run it on your laptop. See http://spark.apache.org/docs/latest/index.html http://spark.apache.org/docs/latest/index.html. Matei On Feb 6, 2015, at 2:58 PM, David Fallside falls...@us.ibm.com wrote:

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Matei Zaharia
I believe this is needed for driver recovery in Spark Streaming. If your Spark driver program crashes, Spark Streaming can recover the application by reading the set of DStreams and output operations from a checkpoint file (see

Re: Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Matei Zaharia
Unfortunately we don't have anything to do with Spark on GCE, so I'd suggest asking in the GCE support forum. You could also try to launch a Spark cluster by hand on nodes in there. Sigmoid Analytics published a package for this here: http://spark-packages.org/package/9 Matei On Jan 17,

Re: spark 1.2 compatibility

2015-01-16 Thread Matei Zaharia
The Apache Spark project should work with it, but I'm not sure you can get support from HDP (if you have that). Matei On Jan 16, 2015, at 5:36 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have not seen a

Re: Pattern Matching / Equals on Case Classes in Spark Not Working

2015-01-12 Thread Matei Zaharia
Is this in the Spark shell? Case classes don't work correctly in the Spark shell unfortunately (though they do work in the Scala shell) because we change the way lines of code compile to allow shipping functions across the network. The best way to get case classes in there is to compile them

Fwd: ApacheCon North America 2015 Call For Papers

2015-01-05 Thread Matei Zaharia
FYI, ApacheCon North America call for papers is up. Matei Begin forwarded message: Date: January 5, 2015 at 9:40:41 AM PST From: Rich Bowen rbo...@rcbowen.com Reply-To: dev d...@community.apache.org To: dev d...@community.apache.org Subject: ApacheCon North America 2015 Call For Papers

Re: JetS3T settings spark

2014-12-30 Thread Matei Zaharia
This file needs to be on your CLASSPATH actually, not just in a directory. The best way to pass it in is probably to package it into your application JAR. You can put it in src/main/resources in a Maven or SBT project, and check that it makes it into the JAR using jar tf yourfile.jar. Matei

Re: action progress in ipython notebook?

2014-12-29 Thread Matei Zaharia
Hey Eric, sounds like you are running into several issues, but thanks for reporting them. Just to comment on a few of these: I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty despite my calling cache(). This is expected until you compute the RDDs the first time

Re: When will spark 1.2 released?

2014-12-18 Thread Matei Zaharia
Yup, as he posted before, An Apache infrastructure issue prevented me from pushing this last night. The issue was resolved today and I should be able to push the final release artifacts tonight. On Dec 18, 2014, at 10:14 PM, Andrew Ash and...@andrewash.com wrote: Patrick is working on the

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
The problem is very likely NFS, not Spark. What kind of network is it mounted over? You can also test the performance of your NFS by copying a file from it to a local disk or to /dev/null and seeing how many bytes per second it can copy. Matei On Dec 17, 2014, at 9:38 AM, Larryliu

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
is running on the same server that Spark is running on. So basically I mount the NFS on the same bare metal machine. Larry On Wed, Dec 17, 2014 at 11:42 AM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: The problem is very likely NFS, not Spark. What kind

Re: Spark SQL Roadmap?

2014-12-13 Thread Matei Zaharia
Spark SQL is already available, the reason for the alpha component label is that we are still tweaking some of the APIs so we have not yet guaranteed API stability for it. However, that is likely to happen soon (possibly 1.3). One of the major things added in Spark 1.2 was an external data

Re: what is the best way to implement mini batches?

2014-12-11 Thread Matei Zaharia
You can just do mapPartitions on the whole RDD, and then called sliding() on the iterator in each one to get a sliding window. One problem is that you will not be able to slide forward into the next partition at partition boundaries. If this matters to you, you need to do something more

Re: dockerized spark executor on mesos?

2014-12-03 Thread Matei Zaharia
I'd suggest asking about this on the Mesos list (CCed). As far as I know, there was actually some ongoing work for this. Matei On Dec 3, 2014, at 9:46 AM, Dick Davies d...@hellooperator.net wrote: Just wondered if anyone had managed to start spark jobs on mesos wrapped in a docker

Re: configure to run multiple tasks on a core

2014-11-26 Thread Matei Zaharia
Instead of SPARK_WORKER_INSTANCES you can also set SPARK_WORKER_CORES, to have one worker that thinks it has more cores. Matei On Nov 26, 2014, at 5:01 PM, Yotto Koga yotto.k...@autodesk.com wrote: Thanks Sean. That worked out well. For anyone who happens onto this post and wants to do

Re: Spark SQL - Any time line to move beyond Alpha version ?

2014-11-25 Thread Matei Zaharia
The main reason for the alpha tag is actually that APIs might still be evolving, but we'd like to freeze the API as soon as possible. Hopefully it will happen in one of 1.3 or 1.4. In Spark 1.2, we're adding an external data source API that we'd like to get experience with before freezing it.

Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
How are you creating the object in your Scala shell? Maybe you can write a function that directly returns the RDD, without assigning the object to a temporary variable. Matei On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote: The closer I look @ the stack trace in the Scala

Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
. On Tue, Nov 25, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: How are you creating the object in your Scala shell? Maybe you can write a function that directly returns the RDD, without assigning the object to a temporary variable. Matei

Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
You can do sbt/sbt assembly/assembly to assemble only the main package. Matei On Nov 25, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote: Hi, The spark assembly is time costly. If I only need the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the

Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
BTW as another tip, it helps to keep the SBT console open as you make source changes (by just running sbt/sbt with no args). It's a lot faster the second time it builds something. Matei On Nov 25, 2014, at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can do sbt/sbt assembly

Re: rack-topology.sh no such file or directory

2014-11-19 Thread Matei Zaharia
Your Hadoop configuration is set to look for this file to determine racks. Is the file present on cluster nodes? If not, look at your hdfs-site.xml and remove the setting for a rack topology script there (or it might be in core-site.xml). Matei On Nov 19, 2014, at 12:13 PM, Arun Luthra

Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Matei Zaharia
Just curious, what are the pros and cons of this? Can the 0.8.1.1 client still talk to 0.8.0 versions of Kafka, or do you need it to match your Kafka version exactly? Matei On Nov 10, 2014, at 9:48 AM, Bhaskar Dutta bhas...@gmail.com wrote: Hi, Is there any plan to bump the Kafka

Re: closure serialization behavior driving me crazy

2014-11-10 Thread Matei Zaharia
Hey Sandy, Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to print the contents of the objects. In addition, something else that helps is to do the following: { val _arr = arr models.map(... _arr ...) } Basically, copy the global variable into a local one.

Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Matei Zaharia
Call getNumPartitions() on your RDD to make sure it has the right number of partitions. You can also specify it when doing parallelize, e.g. rdd = sc.parallelize(xrange(1000), 10)) This should run in parallel if you have multiple partitions and cores, but it might be that during part of the

Re: wierd caching

2014-11-08 Thread Matei Zaharia
It might mean that some partition was computed on two nodes, because a task for it wasn't able to be scheduled locally on the first node. Did the RDD really have 426 partitions total? You can click on it and see where there are copies of each one. Matei On Nov 8, 2014, at 10:16 PM, Nathan

Re: Any Replicated RDD in Spark?

2014-11-05 Thread Matei Zaharia
for me to do that? Collect RDD in driver first and create broadcast? Or any shortcut in spark for this? Thanks! -Original Message- From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, November 05, 2014 3:32 PM To: 'Matei Zaharia' Cc: 'user@spark.apache.org' Subject

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia
this happen. Updated blog post: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
exported from Redshift into Spark or Hadoop. Matei On Nov 4, 2014, at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose

Re: Any Replicated RDD in Spark?

2014-11-03 Thread Matei Zaharia
You need to use broadcast followed by flatMap or mapPartitions to do map-side joins (in your map function, you can look at the hash table you broadcast and see what records match it). Spark SQL also does it by default for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on the results. Matei On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote: I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA),

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Matei. What does unionAll do if the input RDD schemas are not 100% compatible. Does it take the union of the columns and generalize the types? thanks Daniel On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Try unionAll, which

Re: SparkContext.stop() ?

2014-10-31 Thread Matei Zaharia
You don't have to call it if you just exit your application, but it's useful for example in unit tests if you want to create and shut down a separate SparkContext for each test. Matei On Oct 31, 2014, at 10:39 AM, Evan R. Sparks evan.spa...@gmail.com wrote: In cluster settings if you don't

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
Try using --jars instead of the driver-only options; they should work with spark-shell too but they may be less tested. Unfortunately, you do have to specify each JAR separately; you can maybe use a shell script to list a directory and get a big list, or set up a project that builds all of the

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
to spark-shell. Correct? If so I will file a bug report since this is definitely not the case. On Thu, Oct 30, 2014 at 5:39 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Try using --jars instead of the driver-only options; they should work with spark-shell

Re: BUG: when running as extends App, closures don't capture variables

2014-10-29 Thread Matei Zaharia
Good catch! If you'd like, you can send a pull request changing the files in docs/ to do this (see https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark), otherwise maybe open an issue on

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
The overridable methods of RDD are marked as @DeveloperApi, which means that these are internal APIs used by people that might want to extend Spark, but are not guaranteed to remain stable across Spark versions (unlike Spark's public APIs). BTW, if you want a way to do this that does not

Re: Primitive arrays in Spark

2014-10-21 Thread Matei Zaharia
It seems that ++ does the right thing on arrays of longs, and gives you another one: scala val a = Array[Long](1,2,3) a: Array[Long] = Array(1, 2, 3) scala val b = Array[Long](1,2,3) b: Array[Long] = Array(1, 2, 3) scala a ++ b res0: Array[Long] = Array(1, 2, 3, 1, 2, 3) scala res0.getClass

Re: Submissions open for Spark Summit East 2015

2014-10-19 Thread Matei Zaharia
BTW several people asked about registration and student passes. Registration will open in a few weeks, and like in previous Spark Summits, I expect there to be a special pass for students. Matei On Oct 18, 2014, at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: After successful

Re: mllib.linalg.Vectors vs Breeze?

2014-10-18 Thread Matei Zaharia
toBreeze is private within Spark, it should not be accessible to users. If you want to make a Breeze vector from an MLlib one, it's pretty straightforward, and you can make your own utility function for it. Matei On Oct 17, 2014, at 5:09 PM, Sean Owen so...@cloudera.com wrote: Yes, I

Submissions open for Spark Summit East 2015

2014-10-18 Thread Matei Zaharia
After successful events in the past two years, the Spark Summit conference has expanded for 2015, offering both an event in New York on March 18-19 and one in San Francisco on June 15-17. The conference is a great chance to meet people from throughout the Spark community and see the latest

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
of issues. Thanks in advance! On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB

Re: Blog post: An Absolutely Unofficial Way to Connect Tableau to SparkSQL (Spark 1.1)

2014-10-11 Thread Matei Zaharia
Very cool Denny, thanks for sharing this! Matei On Oct 11, 2014, at 9:46 AM, Denny Lee denny.g@gmail.com wrote: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql If you're wondering how to connect Tableau to SparkSQL - here are the steps to connect Tableau to SparkSQL.

Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Matei Zaharia
Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at

Re: add Boulder-Denver Spark meetup to list on website

2014-10-10 Thread Matei Zaharia
Added you, thanks! (You may have to shift-refresh the page to see it updated). Matei On Oct 10, 2014, at 1:52 PM, Michael Oczkowski michael.oczkow...@seeq.com wrote: Please add the Boulder-Denver Spark meetup group to the list on the website.

Re: Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Matei Zaharia
A SchemaRDD is still an RDD, so you can just do rdd.map(row = row.toString). Or if you want to get a particular field of the row, you can do rdd.map(row = row(3).toString). Matei On Oct 9, 2014, at 1:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote: I've a SchemaRDD that I want to

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full. Matei On Oct 7, 2014, at 11:04 PM, Haopu Wang

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Matei Zaharia
The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd write this in HiveQL, and then try doing that with HiveContext.

Re: run scalding on spark

2014-10-01 Thread Matei Zaharia
Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects. Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko...@tresata.com wrote: well, sort of! we make input/output formats (cascading taps,

Re: Spark And Mapr

2014-10-01 Thread Matei Zaharia
It should just work in PySpark, the same way it does in Java / Scala apps. Matei On Oct 1, 2014, at 4:12 PM, Sungwook Yoon sy...@maprtech.com wrote: Yes.. you should use maprfs:// I personally haven't used pyspark, I just used scala shell or standalone with MapR. I think you need to

Re: Multiple spark shell sessions

2014-10-01 Thread Matei Zaharia
You need to set --total-executor-cores to limit how many total cores it grabs on the cluster. --executor-cores is just for each individual executor, but it will try to launch many of them. Matei On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey

Re: Spark Code to read RCFiles

2014-09-23 Thread Matei Zaharia
Is your file managed by Hive (and thus present in a Hive metastore)? In that case, Spark SQL (https://spark.apache.org/docs/latest/sql-programming-guide.html) is the easiest way. Matei On September 23, 2014 at 2:26:10 PM, Pramod Biligiri (pramodbilig...@gmail.com) wrote: Hi, I'm trying to

Re: Possibly a dumb question: differences between saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset?

2014-09-22 Thread Matei Zaharia
File takes a filename to write to, while Dataset takes only a JobConf. This means that Dataset is more general (it can also save to storage systems that are not file systems, such as key-value stores), but is more annoying to use if you actually have a file. Matei On September 21, 2014 at

Re: paging through an RDD that's too large to collect() all at once

2014-09-18 Thread Matei Zaharia
Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads one RDD partition at a time. Scala iterators also have methods like grouped() that let you get fixed-size groups. Matei On September 18, 2014 at 7:58:34 PM, dave-anderson (david.ander...@pobox.com) wrote: I have an

Re: Short Circuit Local Reads

2014-09-17 Thread Matei Zaharia
I'm pretty sure it does help, though I don't have any numbers for it. In any case, Spark will automatically benefit from this if you link it to a version of HDFS that contains this. Matei On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com) wrote: Cloudera had a blog post

Re: Spark as a Library

2014-09-16 Thread Matei Zaharia
If you want to run the computation on just one machine (using Spark's local mode), it can probably run in a container. Otherwise you can create a SparkContext there and connect it to a cluster outside. Note that I haven't tried this though, so the security policies of the container might be too

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
Scala 2.11 work is under way in open pull requests though, so hopefully it will be in soon. Matei On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: ah...thanks! On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote: No, not yet.  Spark SQL is

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
at the earliest. On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Scala 2.11 work is under way in open pull requests though, so hopefully it will be in soon. Matei On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: ah...thanks! On Mon, Sep 15

  1   2   3   >