I’m pretty sure that Catalyst was built before Calcite, or at least in
parallel. Calcite 1.0 was only released in 2015. From a technical standpoint,
building Catalyst in Scala also made it more concise and easier to extend than
an optimizer written in Java (you can find various presentations
Hi Bartosz,
This is because the vote on 2.4 has passed (you can see the vote thread on the
dev mailing list) and we are just working to get the release into various
channels (Maven, PyPI, etc), which can take some time. Expect to see an
announcement soon once that’s done.
Matei
> On Nov 4,
s prefer to get that notification sooner
> rather than later?
>
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia
> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating
> it. Since it is not EOL yet, it might make sense to only deprecate it once
> it’s
I’d like to understand the maintenance burden of Python 2 before deprecating
it. Since it is not EOL yet, it might make sense to only deprecate it once it’s
EOL (which is still over a year from now). Supporting Python 2+3 seems less
burdensome than supporting, say, multiple Scala versions in
GraphFrames (https://graphframes.github.io) offers a Cypher-like syntax that
then executes on Spark SQL.
> On Sep 14, 2018, at 2:42 AM, kant kodali wrote:
>
> Hi All,
>
> Is there any open source framework that converts Cypher to SparkSQL?
>
> Thanks!
Maybe your application is overriding the master variable when it creates its
SparkContext. I see you are still passing “yarn-client” as an argument later to
it in your command.
> On Jun 17, 2018, at 11:53 AM, Raymond Xie wrote:
>
> Thank you Subhash.
>
> Here is the new command:
>
Hi Ismael,
It depends on what you mean by “support”. In general, there won’t be new
feature releases for 1.X (e.g. Spark 1.7) because all the new features are
being added to the master branch. However, there is always room for bug fix
releases if there is a catastrophic bug, and committers can
The batches should all have the same application ID, so use that one. You can
also find the application in the YARN UI to terminate it from there.
Matei
> On Aug 27, 2017, at 10:27 AM, KhajaAsmath Mohammed
> wrote:
>
> Hi,
>
> I am new to spark streaming and not
You can also find a lot of GitHub repos for external packages here:
http://spark.apache.org/third-party-projects.html
Matei
> On Jul 25, 2017, at 5:30 PM, Frank Austin Nothaft
> wrote:
>
> There’s a number of real-world open source Spark applications in the sciences:
>
The Kafka source will only appear in 2.0.2 -- see this thread for the current
release candidate:
https://lists.apache.org/thread.html/597d630135e9eb3ede54bb0cc0b61a2b57b189588f269a64b58c9243@%3Cdev.spark.apache.org%3E
. You can try that right now if you want from the staging Maven repo shown
I think people explained this pretty well, but in practice, this distinction is
also somewhat of a marketing term, because every system will perform some kind
of batching. For example, every time you use TCP, the OS and network stack may
buffer multiple messages together and send them at once;
To unsubscribe, please send an email to user-unsubscr...@spark.apache.org from
the address you're subscribed from.
Matei
> On Aug 10, 2016, at 12:48 PM, Sohil Jain wrote:
>
>
-
To unsubscribe
Yes, a built-in mechanism is planned in future releases. You can also drop it
using a filter for now but the stateful operators will still keep state for old
windows.
Matei
> On Aug 6, 2016, at 9:40 AM, Amit Sela wrote:
>
> I've noticed that when using Structured
Yup, they will definitely coexist. Structured Streaming is currently alpha and
will probably be complete in the next few releases, but Spark Streaming will
continue to exist, because it gives the user more low-level control. It's
similar to DataFrames vs RDDs (RDDs are the lower-level API for
Hi all, FYI, we've recently updated the Spark logo at https://spark.apache.org/
to say "Apache Spark" instead of just "Spark". Many ASF projects have been
doing this recently to make it clearer that they are associated with the ASF,
and indeed the ASF's branding guidelines generally require
I don't think any of the developers use this as an official channel, but all
the ASF IRC channels are indeed on FreeNode. If there's demand for it, we can
document this on the website and say that it's mostly for users to find other
users. Development discussions should happen on the dev
This sounds good to me as well. The one thing we should pay attention to is how
we update the docs so that people know to start with the spark.ml classes.
Right now the docs list spark.mllib first and also seem more comprehensive in
that area than in spark.ml, so maybe people naturally move
able to dispatch jobs from both actions simultaneously (or on a
> when-workers-become-available basis)?
>
> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com
> <mailto:ko...@tresata.com>> wrote:
> we run multiple actions on the same (cached) rdd a
RDDs actually are thread-safe, and quite a few applications use them this way,
e.g. the JDBC server.
Matei
> On Jan 15, 2016, at 2:10 PM, Jakob Odersky wrote:
>
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in
>
Have you tried just downloading a pre-built package, or linking to Spark
through Maven? You don't need to build it unless you are changing code inside
it. Check out
http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications
for how to link to it.
Matei
> On Jan 15,
In production, I'd recommend using IAM roles to avoid having keys altogether.
Take a look at
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html.
Matei
> On Jan 11, 2016, at 11:32 AM, Sabarish Sasidharan
> wrote:
>
> If you are
You can publish your version of Hadoop to your Maven cache with mvn publish
(just give it a different version number, e.g. 2.7.0a) and then pass that as
the Hadoop version to Spark's build (see
http://spark.apache.org/docs/latest/building-spark.html
If you run on YARN, you can use Kerberos, be authenticated as the right user,
etc in the same way as MapReduce jobs.
Matei
> On Sep 3, 2015, at 1:37 PM, Daniel Schulz
> wrote:
>
> Hi,
>
> I really enjoy using Spark. An obstacle to sell it to our clients
entitled to read/write? Will
> it enforce HDFS ACLs and Ranger policies as well?
>
> Best regards, Daniel.
>
> > On 03 Sep 2015, at 21:16, Matei Zaharia <matei.zaha...@gmail.com
> > <mailto:matei.zaha...@gmail.com>> wrote:
> >
> > If you ru
Thus means that one of your cached RDD partitions is bigger than 2 GB of data.
You can fix it by having more partitions. If you read data from a file system
like HDFS or S3, set the number of partitions higher in the sc.textFile,
hadoopFile, etc methods (it's an optional second parameter to
This documentation is only for writes to an external system, but all the
counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow
to keep track of a running count) is exactly-once. When you write to a storage
system, no matter which streaming framework you use, you'll
[4,5,6] can be invoked before the operation for offset [1,2,3]
2) If you wanted to achieve something similar to what TridentState does,
you'll have to do it yourself (for example using Zookeeper)
Is this a correct understanding?
On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia matei.zaha
This happens automatically when you use the byKey operations, e.g. reduceByKey,
updateStateByKey, etc. Spark Streaming keeps the state for a given set of keys
on a specific node and sends new tuples with that key to that.
Matei
On Jun 3, 2015, at 6:31 AM, allonsy luke1...@gmail.com wrote:
:-UseCompressedOops
SPARK_DRIVER_MEMORY=129G
spark version: 1.1.1
Thank you a lot for your help!
2015-06-02 4:40 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com:
As long as you don't use cache(), these operations will go from disk to disk,
and will only use
?
Thank you!
2015-06-02 21:25 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com:
You shouldn't have to persist the RDD at all, just call flatMap and reduce on
it directly. If you try to persist it, that will try to load the original dat
into memory, but here
Check out Apache's trademark guidelines here:
http://www.apache.org/foundation/marks/
http://www.apache.org/foundation/marks/
Matei
On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com wrote:
What is the license on using the spark logo. Is it free to be used for
displaying
Hey Tom,
Are you using the fine-grained or coarse-grained scheduler? For the
coarse-grained scheduler, there is a spark.cores.max config setting that will
limit the total # of cores it grabs. This was there in earlier versions too.
Matei
On May 19, 2015, at 12:39 PM, Thomas Dudziak
of tasks per job :)
cheers,
Tom
On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com wrote:
Hey Tom,
Are you using the fine-grained or coarse-grained scheduler? For the
coarse-grained scheduler, there is a spark.cores.max config
...This is madness!
On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote:
Hi there,
We have released our real-time aggregation engine based on Spark Streaming.
SPARKTA is fully open source (Apache2)
You can checkout the slides showed up at the Strata past week:
(Sorry, for non-English people: that means it's a good thing.)
Matei
On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
...This is madness!
On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote:
Hi there,
We have released our real-time
It could also be that your hash function is expensive. What is the key class
you have for the reduceByKey / groupByKey?
Matei
On May 12, 2015, at 10:08 AM, Night Wolf nightwolf...@gmail.com wrote:
I'm seeing a similar thing with a slightly different stack trace. Ideas?
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to
Windows. AFAIK it should build on Windows too, the only problem is that Maven
might take a long time to download dependencies. What errors are you seeing?
Matei
On Apr 16, 2015, at 9:23 AM, Arun Lists
Very neat, Olivier; thanks for sharing this.
Matei
On Apr 15, 2015, at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote:
Dear Spark users,
I would like to draw your attention to a dataset that we recently released,
which is as of now the largest machine learning dataset ever released;
Feel free to send a pull request to fix the doc (or say which versions it's
needed in).
Matei
On Mar 20, 2015, at 6:49 PM, Krishna Sankar ksanka...@gmail.com wrote:
Yep the command-option is gone. No big deal, just add the '%pylab inline'
command as part of your notebook.
Cheers
k/
The programming guide has a short example:
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets.
Note that once you infer a schema for a JSON dataset, you can also use nested
path notation
Thanks! I've added you.
Matei
On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu
ra...@the4thfloor.eu wrote:
Hi,
there is a small Spark Meetup group in Berlin, Germany :-)
http://www.meetup.com/Berlin-Apache-Spark-Meetup/
Plaes add this group to the Meetups list at
You don't need HDFS or virtual machines to run Spark. You can just download it,
unzip it and run it on your laptop. See
http://spark.apache.org/docs/latest/index.html
http://spark.apache.org/docs/latest/index.html.
Matei
On Feb 6, 2015, at 2:58 PM, David Fallside falls...@us.ibm.com wrote:
I believe this is needed for driver recovery in Spark Streaming. If your Spark
driver program crashes, Spark Streaming can recover the application by reading
the set of DStreams and output operations from a checkpoint file (see
Unfortunately we don't have anything to do with Spark on GCE, so I'd suggest
asking in the GCE support forum. You could also try to launch a Spark cluster
by hand on nodes in there. Sigmoid Analytics published a package for this here:
http://spark-packages.org/package/9
Matei
On Jan 17,
The Apache Spark project should work with it, but I'm not sure you can get
support from HDP (if you have that).
Matei
On Jan 16, 2015, at 5:36 PM, Judy Nash judyn...@exchange.microsoft.com
wrote:
Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have
not seen a
Is this in the Spark shell? Case classes don't work correctly in the Spark
shell unfortunately (though they do work in the Scala shell) because we change
the way lines of code compile to allow shipping functions across the network.
The best way to get case classes in there is to compile them
FYI, ApacheCon North America call for papers is up.
Matei
Begin forwarded message:
Date: January 5, 2015 at 9:40:41 AM PST
From: Rich Bowen rbo...@rcbowen.com
Reply-To: dev d...@community.apache.org
To: dev d...@community.apache.org
Subject: ApacheCon North America 2015 Call For Papers
This file needs to be on your CLASSPATH actually, not just in a directory. The
best way to pass it in is probably to package it into your application JAR. You
can put it in src/main/resources in a Maven or SBT project, and check that it
makes it into the JAR using jar tf yourfile.jar.
Matei
Hey Eric, sounds like you are running into several issues, but thanks for
reporting them. Just to comment on a few of these:
I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty
despite my calling cache().
This is expected until you compute the RDDs the first time
Yup, as he posted before, An Apache infrastructure issue prevented me from
pushing this last night. The issue was resolved today and I should be able to
push the final release artifacts tonight.
On Dec 18, 2014, at 10:14 PM, Andrew Ash and...@andrewash.com wrote:
Patrick is working on the
The problem is very likely NFS, not Spark. What kind of network is it mounted
over? You can also test the performance of your NFS by copying a file from it
to a local disk or to /dev/null and seeing how many bytes per second it can
copy.
Matei
On Dec 17, 2014, at 9:38 AM, Larryliu
is running on the same server that Spark
is running on. So basically I mount the NFS on the same bare metal machine.
Larry
On Wed, Dec 17, 2014 at 11:42 AM, Matei Zaharia matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com wrote:
The problem is very likely NFS, not Spark. What kind
Spark SQL is already available, the reason for the alpha component label is
that we are still tweaking some of the APIs so we have not yet guaranteed API
stability for it. However, that is likely to happen soon (possibly 1.3). One of
the major things added in Spark 1.2 was an external data
You can just do mapPartitions on the whole RDD, and then called sliding() on
the iterator in each one to get a sliding window. One problem is that you will
not be able to slide forward into the next partition at partition boundaries.
If this matters to you, you need to do something more
I'd suggest asking about this on the Mesos list (CCed). As far as I know, there
was actually some ongoing work for this.
Matei
On Dec 3, 2014, at 9:46 AM, Dick Davies d...@hellooperator.net wrote:
Just wondered if anyone had managed to start spark
jobs on mesos wrapped in a docker
Instead of SPARK_WORKER_INSTANCES you can also set SPARK_WORKER_CORES, to have
one worker that thinks it has more cores.
Matei
On Nov 26, 2014, at 5:01 PM, Yotto Koga yotto.k...@autodesk.com wrote:
Thanks Sean. That worked out well.
For anyone who happens onto this post and wants to do
The main reason for the alpha tag is actually that APIs might still be
evolving, but we'd like to freeze the API as soon as possible. Hopefully it
will happen in one of 1.3 or 1.4. In Spark 1.2, we're adding an external data
source API that we'd like to get experience with before freezing it.
How are you creating the object in your Scala shell? Maybe you can write a
function that directly returns the RDD, without assigning the object to a
temporary variable.
Matei
On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote:
The closer I look @ the stack trace in the Scala
.
On Tue, Nov 25, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com wrote:
How are you creating the object in your Scala shell? Maybe you can write a
function that directly returns the RDD, without assigning the object to a
temporary variable.
Matei
You can do sbt/sbt assembly/assembly to assemble only the main package.
Matei
On Nov 25, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote:
Hi,
The spark assembly is time costly. If I only need the
spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the
BTW as another tip, it helps to keep the SBT console open as you make source
changes (by just running sbt/sbt with no args). It's a lot faster the second
time it builds something.
Matei
On Nov 25, 2014, at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
You can do sbt/sbt assembly
Your Hadoop configuration is set to look for this file to determine racks. Is
the file present on cluster nodes? If not, look at your hdfs-site.xml and
remove the setting for a rack topology script there (or it might be in
core-site.xml).
Matei
On Nov 19, 2014, at 12:13 PM, Arun Luthra
Just curious, what are the pros and cons of this? Can the 0.8.1.1 client still
talk to 0.8.0 versions of Kafka, or do you need it to match your Kafka version
exactly?
Matei
On Nov 10, 2014, at 9:48 AM, Bhaskar Dutta bhas...@gmail.com wrote:
Hi,
Is there any plan to bump the Kafka
Hey Sandy,
Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to
print the contents of the objects. In addition, something else that helps is to
do the following:
{
val _arr = arr
models.map(... _arr ...)
}
Basically, copy the global variable into a local one.
Call getNumPartitions() on your RDD to make sure it has the right number of
partitions. You can also specify it when doing parallelize, e.g.
rdd = sc.parallelize(xrange(1000), 10))
This should run in parallel if you have multiple partitions and cores, but it
might be that during part of the
It might mean that some partition was computed on two nodes, because a task for
it wasn't able to be scheduled locally on the first node. Did the RDD really
have 426 partitions total? You can click on it and see where there are copies
of each one.
Matei
On Nov 8, 2014, at 10:16 PM, Nathan
for me to do that? Collect RDD in driver first and create broadcast? Or
any shortcut in spark for this?
Thanks!
-Original Message-
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Wednesday, November 05, 2014 3:32 PM
To: 'Matei Zaharia'
Cc: 'user@spark.apache.org'
Subject
this
happen.
Updated blog post:
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general
provides a broader set of capabilities than Redshift because it has APIs in
general-purpose languages (Java, Scala, Python) and libraries for things like
machine learning and graph processing. For example, you might use
exported from Redshift into Spark or Hadoop.
Matei
On Nov 4, 2014, at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general
provides a broader set of capabilities than Redshift because it has APIs in
general-purpose
You need to use broadcast followed by flatMap or mapPartitions to do map-side
joins (in your map function, you can look at the hash table you broadcast and
see what records match it). Spark SQL also does it by default for tables
smaller than the spark.sql.autoBroadcastJoinThreshold setting (by
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on
the results.
Matei
On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote:
I would like to combine 2 parquet tables I have create.
I tried:
sc.union(sqx.parquetFile(fileA),
Matei. What does unionAll do if the input RDD schemas are not 100%
compatible. Does it take the union of the columns and generalize the types?
thanks
Daniel
On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com wrote:
Try unionAll, which
You don't have to call it if you just exit your application, but it's useful
for example in unit tests if you want to create and shut down a separate
SparkContext for each test.
Matei
On Oct 31, 2014, at 10:39 AM, Evan R. Sparks evan.spa...@gmail.com wrote:
In cluster settings if you don't
Try using --jars instead of the driver-only options; they should work with
spark-shell too but they may be less tested.
Unfortunately, you do have to specify each JAR separately; you can maybe use a
shell script to list a directory and get a big list, or set up a project that
builds all of the
to spark-shell. Correct? If so I will file a bug
report since this is definitely not the case.
On Thu, Oct 30, 2014 at 5:39 PM, Matei Zaharia matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com wrote:
Try using --jars instead of the driver-only options; they should work with
spark-shell
Good catch! If you'd like, you can send a pull request changing the files in
docs/ to do this (see
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark),
otherwise maybe open an issue on
A pretty large fraction of users use Java, but a few features are still not
available in it. JdbcRDD is one of them -- this functionality will likely be
superseded by Spark SQL when we add JDBC as a data source. In the meantime, to
use it, I'd recommend writing a class in Scala that has
The overridable methods of RDD are marked as @DeveloperApi, which means that
these are internal APIs used by people that might want to extend Spark, but are
not guaranteed to remain stable across Spark versions (unlike Spark's public
APIs).
BTW, if you want a way to do this that does not
It seems that ++ does the right thing on arrays of longs, and gives you another
one:
scala val a = Array[Long](1,2,3)
a: Array[Long] = Array(1, 2, 3)
scala val b = Array[Long](1,2,3)
b: Array[Long] = Array(1, 2, 3)
scala a ++ b
res0: Array[Long] = Array(1, 2, 3, 1, 2, 3)
scala res0.getClass
BTW several people asked about registration and student passes. Registration
will open in a few weeks, and like in previous Spark Summits, I expect there to
be a special pass for students.
Matei
On Oct 18, 2014, at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
After successful
toBreeze is private within Spark, it should not be accessible to users. If you
want to make a Breeze vector from an MLlib one, it's pretty straightforward,
and you can make your own utility function for it.
Matei
On Oct 17, 2014, at 5:09 PM, Sean Owen so...@cloudera.com wrote:
Yes, I
After successful events in the past two years, the Spark Summit conference has
expanded for 2015, offering both an event in New York on March 18-19 and one in
San Francisco on June 15-17. The conference is a great chance to meet people
from throughout the Spark community and see the latest
of issues. Thanks in advance!
On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some pretty
cool news for the project, which is that we've been able to use Spark to
break MapReduce's 100 TB
Very cool Denny, thanks for sharing this!
Matei
On Oct 11, 2014, at 9:46 AM, Denny Lee denny.g@gmail.com wrote:
https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
If you're wondering how to connect Tableau to SparkSQL - here are the steps
to connect Tableau to SparkSQL.
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some pretty
cool news for the project, which is that we've been able to use Spark to break
MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer
nodes. There's a detailed writeup at
Added you, thanks! (You may have to shift-refresh the page to see it updated).
Matei
On Oct 10, 2014, at 1:52 PM, Michael Oczkowski michael.oczkow...@seeq.com
wrote:
Please add the Boulder-Denver Spark meetup group to the list on the website.
A SchemaRDD is still an RDD, so you can just do rdd.map(row = row.toString).
Or if you want to get a particular field of the row, you can do rdd.map(row =
row(3).toString).
Matei
On Oct 9, 2014, at 1:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote:
I've a SchemaRDD that I want to
I'm pretty sure inner joins on Spark SQL already build only one of the sides.
Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer
joins do both, and it seems like we could optimize it for those that are not
full.
Matei
On Oct 7, 2014, at 11:04 PM, Haopu Wang
The issue is that you're using SQLContext instead of HiveContext. SQLContext
implements a smaller subset of the SQL language and so you're getting a SQL
parse error because it doesn't support the syntax you have. Look at how you'd
write this in HiveQL, and then try doing that with HiveContext.
Pretty cool, thanks for sharing this! I've added a link to it on the wiki:
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects.
Matei
On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko...@tresata.com wrote:
well, sort of! we make input/output formats (cascading taps,
It should just work in PySpark, the same way it does in Java / Scala apps.
Matei
On Oct 1, 2014, at 4:12 PM, Sungwook Yoon sy...@maprtech.com wrote:
Yes.. you should use maprfs://
I personally haven't used pyspark, I just used scala shell or standalone with
MapR.
I think you need to
You need to set --total-executor-cores to limit how many total cores it grabs
on the cluster. --executor-cores is just for each individual executor, but it
will try to launch many of them.
Matei
On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.INVALID wrote:
hey
Is your file managed by Hive (and thus present in a Hive metastore)? In that
case, Spark SQL
(https://spark.apache.org/docs/latest/sql-programming-guide.html) is the
easiest way.
Matei
On September 23, 2014 at 2:26:10 PM, Pramod Biligiri (pramodbilig...@gmail.com)
wrote:
Hi,
I'm trying to
File takes a filename to write to, while Dataset takes only a JobConf. This
means that Dataset is more general (it can also save to storage systems that
are not file systems, such as key-value stores), but is more annoying to use if
you actually have a file.
Matei
On September 21, 2014 at
Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads
one RDD partition at a time. Scala iterators also have methods like grouped()
that let you get fixed-size groups.
Matei
On September 18, 2014 at 7:58:34 PM, dave-anderson (david.ander...@pobox.com)
wrote:
I have an
I'm pretty sure it does help, though I don't have any numbers for it. In any
case, Spark will automatically benefit from this if you link it to a version of
HDFS that contains this.
Matei
On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com) wrote:
Cloudera had a blog post
If you want to run the computation on just one machine (using Spark's local
mode), it can probably run in a container. Otherwise you can create a
SparkContext there and connect it to a cluster outside. Note that I haven't
tried this though, so the security policies of the container might be too
Scala 2.11 work is under way in open pull requests though, so hopefully it will
be in soon.
Matei
On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote:
ah...thanks!
On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote:
No, not yet. Spark SQL is
at the earliest.
On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Scala 2.11 work is under way in open pull requests though, so hopefully it will
be in soon.
Matei
On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote:
ah...thanks!
On Mon, Sep 15
1 - 100 of 284 matches
Mail list logo