Are you using the logistic_regression.py in examples/src/main/python or
examples/src/main/python/mllib? The first one is an example of writing logistic
regression by hand and won’t be as efficient as the MLlib one. I suggest trying
the MLlib one.
You may also want to check how many iterations
.
The MLLib version of logistic regression doesn't seem to use all the cores on
my machine.
Regards,
Krishna
On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Are you using the logistic_regression.py in examples/src/main/python or
examples/src/main
in java and port
it into Spark?
Best regards,
Wei
-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan
From:Matei Zaharia matei.zaha...@gmail.com
To:user
, June 5, 2014 1:35 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
If this isn’t the problem, it would be great if you can post the code for the
program.
Matei
On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen xche...@gmail.com wrote:
Maybe your two workers have different assembly jar
If this is a useful feature for local mode, we should open a JIRA to document
the setting or improve it (I’d prefer to add a spark.local.retries property
instead of a special URL format). We initially disabled it for everything
except unit tests because 90% of the time an exception in local
You currently can’t have multiple SparkContext objects in the same JVM, but
within a SparkContext, all of the APIs are thread-safe so you can share that
context between multiple threads. The other issue you’ll run into is that in
each thread where you want to use Spark, you need to use
. If we can
disable the UI http Server; it would be much simpler to handle than having
two http containers to deal with.
Chester
On Monday, June 9, 2014 4:35 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
You currently can’t have multiple SparkContext objects in the same JVM
It might be that conf/spark-env.sh on EC2 is configured to set it to 512, and
is overriding the application’s settings. Take a look in there and delete that
line if possible.
Matei
On Jun 10, 2014, at 2:38 PM, Aliaksei Litouka aliaksei.lito...@gmail.com
wrote:
I am testing my application in
Alright, added you.
Matei
On Jun 11, 2014, at 1:28 PM, Derek Mansen de...@vistarmedia.com wrote:
Hello, I was wondering if we could add our organization to the Powered by
Spark page. The information is:
Name: Vistar Media
URL: www.vistarmedia.com
Description: Location technology company
combineByKey is designed for when your return type from the aggregation is
different from the values being aggregated (e.g. you group together objects),
and it should allow you to modify the leftmost argument of each function
(mergeCombiners, mergeValue, etc) and return that instead of
and will post it if I find it :)
Thank you, anyway
On Wed, Jun 11, 2014 at 12:19 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
It might be that conf/spark-env.sh on EC2 is configured to set it to 512, and
is overriding the application’s settings. Take a look in there and delete
You need to factor your program so that it’s not just a main(). This is not a
Spark-specific issue, it’s about how you’d unit test any program in general. In
this case, your main() creates a SparkContext, so you can’t pass one from
outside, and your code has to read data from a file and write
The order is not guaranteed actually, only which keys end up in each partition.
Reducers may fetch data from map tasks in an arbitrary order, depending on
which ones are available first. If you’d like a specific order, you should sort
each partition. Here you might be getting it because each
It’s true that it can’t. You can try to use the CloudPickle library instead,
which is what we use within PySpark to serialize functions (see
python/pyspark/cloudpickle.py). However I’m also curious, why do you need an
RDD of functions?
Matei
On Jun 15, 2014, at 4:49 PM, madeleine
is that I'm using alternating minimization, so I'll be
minimizing over the rows and columns of this matrix at alternating steps;
hence I need to store both the matrix and its transpose to avoid data
thrashing.
On Mon, Jun 16, 2014 at 11:05 AM, Matei Zaharia [via Apache Spark User List]
[hidden
There are a few options:
- Kryo might be able to serialize these objects out of the box, depending
what’s inside them. Try turning it on as described at
http://spark.apache.org/docs/latest/tuning.html.
- If that doesn’t work, you can create your own “wrapper” objects that
implement
Interesting, does anyone know the people over there who set it up? It would be
good if Apache itself could publish packages there, though I’m not sure what’s
involved. Since Spark just depends on Java and Python it should be easy for us
to update.
Matei
On Jun 18, 2014, at 1:37 PM, Nick
I was going to suggest the same thing :).
On Jun 18, 2014, at 4:56 PM, Evan R. Sparks evan.spa...@gmail.com wrote:
This looks like a job for SparkSQL!
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class MyRecord(country: String, name: String, age: Int,
You should be able to use many of the MLlib Model objects directly in Storm, if
you save them out using Java serialization. The only one that won’t work is
probably ALS, because it’s a distributed model.
Otherwise, you will have to output them in your own format and write code for
evaluating
customer targetting, accurate inventory and efficient analysis.
Thanks!
Best Regards,
Sonal
Nube Technologies
On Thu, Jun 12, 2014 at 11:33 PM, Derek Mansen de...@vistarmedia.com wrote:
Awesome, thank you!
On Wed, Jun 11, 2014 at 6:53 PM, Matei Zaharia matei.zaha
I’d suggest asking the IBM Hadoop folks, but my guess is that the library
cannot be found in /opt/IHC/lib/native/Linux-amd64-64/. Or maybe if this
exception is happening in your driver program, the driver program’s
java.library.path doesn’t include this. (SPARK_LIBRARY_PATH from spark-env.sh
Try looking at the running processes with “ps” to see their full command line
and see whether any options are different. It seems like in both cases, your
young generation is quite large (11 GB), which doesn’t make lot of sense with a
heap of 15 GB. But maybe I’m misreading something.
Matei
Spark SQL in Spark 1.1 will include all the functionality in Shark; take a look
at
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html.
We decided to do this because at the end of the day, the only code left in
Shark was the JDBC / Thrift
When you use hadoopConfiguration directly, I don’t think you have to replace
the “/“ with “%2f”. Have you tried it without that? Also make sure you’re not
replacing slashes in the URL itself.
Matei
On Jul 2, 2014, at 4:17 PM, Brian Gawalt bgaw...@gmail.com wrote:
Hello everyone,
I'm
They are for CDH4 without YARN, since YARN is experimental in that. You can
download one of the Hadoop 2 packages if you want to run on YARN. Or you might
have to build specifically against CDH4's version of YARN if that doesn't work.
Matei
On Jul 7, 2014, at 9:37 PM, ch huang
Thanks for catching this. For now you can just access the page through http://
instead of https:// to avoid this.
Matei
On Jul 8, 2014, at 10:46 PM, binbinbin915 binbinbin...@live.cn wrote:
https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
on Chrome 35 with
The UI code is the same in both, but one possibility is that your executors
were given less memory on YARN. Can you check that? Or otherwise, how do you
know that some RDDs were cached?
Matei
On Jul 12, 2014, at 4:12 PM, Koert Kuipers ko...@tresata.com wrote:
hey shuo,
so far all stage
You currently can't use SparkContext inside a Spark task, so in this case you'd
have to call some kind of local K-means library. One example you can try to use
is Weka (http://www.cs.waikato.ac.nz/ml/weka/). You can then load your text
files as an RDD of strings with SparkContext.wholeTextFiles
Are you increasing the number of parallel tasks with cores as well? With more
tasks there will be more data communicated and hence more calls to these
functions.
Unfortunately contention is kind of hard to measure, since often the result is
that you see many cores idle as they're waiting on a
I think coalesce with shuffle=true will force it to have one task per node.
Without that, it might be that due to data locality it decides to launch
multiple ones on the same node even though the total # of tasks is equal to the
# of nodes.
If this is the *only* thing you run on the cluster,
Yeah, I'd just add a spark-util that has these things.
Matei
On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote:
Yeah, sadly this dependency was introduced when someone consolidated the
logging infrastructure. However, the dependency should be very small and
thus
The script should be there, in the spark/bin directory. What command did you
use to launch the cluster?
Matei
On Jul 14, 2014, at 1:12 PM, Josh Happoldt josh.happo...@trueffect.com wrote:
Hi All,
I've used the spark-ec2 scripts to build a simple 1.0.1 Standalone cluster on
EC2. It
of them
here, but if your file is big it will also have at least one task per 32 MB
block of the file.
Matei
On Jul 14, 2014, at 6:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
I see, so here might be the problem. With more cores, there's less memory
available per core, and now many of your
You can change this setting through SparkContext.hadoopConfiguration, or put
the conf/ directory of your Hadoop installation on the CLASSPATH when you
launch your app so that it reads the config values from there.
Matei
On Jul 14, 2014, at 8:06 PM, valgrind_girl 124411...@qq.com wrote:
eager
Hi Nathan,
I think there are two possible reasons for this. One is that even though you
are caching RDDs, their lineage chain gets longer and longer, and thus
serializing each RDD takes more time. You can cut off the chain by using
RDD.checkpoint() periodically, say every 5-10 iterations. The
Yeah, this is handled by the commit call of the FileOutputFormat. In general
Hadoop OutputFormats have a concept called committing the output, which you
should do only once per partition. In the file ones it does an atomic rename to
make sure that the final output is a complete file.
Matei
On
Yeah, we try to have a regular 3 month release cycle; see
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current
window.
Matei
On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote:
You should expect master to compile and run: patches aren't merged
Hi Kamal,
This is not what preservesPartitioning does -- actually what it means is that
if the RDD has a Partitioner set (which means it's an RDD of key-value pairs
and the keys are grouped into a known way, e.g. hashed or range-partitioned),
your map function is not changing the partition of
It's possible using the --queue argument of spark-submit. Unfortunately this is
not documented on http://spark.apache.org/docs/latest/running-on-yarn.html but
it appears if you just type spark-submit --help or spark-submit with no
arguments.
Matei
On Jul 17, 2014, at 2:33 AM, Konstantin
Good question.. I'll ask INFRA because I haven't seen other Apache mailing
lists provide this. It would indeed be helpful.
Matei
On Jul 17, 2014, at 12:59 PM, Nick Chammas nicholas.cham...@gmail.com wrote:
Can we modify the mailing list to include permalinks to the thread in the
footer of
Is this with the 1.0.0 scripts? I believe it's fixed in 1.0.1.
Matei
On Jul 20, 2014, at 1:22 AM, Chris DuBois chris.dub...@gmail.com wrote:
Using the spark-ec2 script with m3.2xlarge instances seems to not have /mnt
and /mnt2 pointing to the 80gb SSDs that come with that instance. Does
Actually the script in the master branch is also broken (it's pointing to an
older AMI). Try 1.0.1 for launching clusters.
On Jul 20, 2014, at 2:25 PM, Chris DuBois chris.dub...@gmail.com wrote:
I pulled the latest last night. I'm on commit 4da01e3.
On Sun, Jul 20, 2014 at 2:08 PM, Matei
Is the first() being computed locally on the driver program? Maybe it's to hard
to compute with the memory, etc available there. Take a look at the driver's
log and see whether it has the message Computing the requested partition
locally.
Matei
On Jul 22, 2014, at 12:04 PM, Nathan Kronenfeld
This is being tracked here: https://issues.apache.org/jira/browse/SPARK-1812,
since it will also be needed for cross-building with Scala 2.11. Maybe we can
do it before that. Probably too late for 1.1, but you should open an issue for
1.2.
In that JIRA I linked, there's a pull request from a
The Pair ones return a JavaPairRDD, which has additional operations on
key-value pairs. Take a look at
http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs
for details.
Matei
On Jul 24, 2014, at 3:41 PM, abhiguruvayya sharath.abhis...@gmail.com wrote:
Can
These messages are actually not about spilling the RDD, they're about spilling
intermediate state in a reduceByKey, groupBy or other operation whose state
doesn't fit in memory. We have to do that in these cases to avoid going out of
memory. You can minimize spilling by having more reduce tasks
Even in local mode, Spark serializes data that would be sent across the
network, e.g. in a reduce operation, so that you can catch errors that would
happen in distributed mode. You can make serialization much faster by using the
Kryo serializer; see
These numbers are from GPUs and Intel MKL (a closed-source math library for
Intel processors), where for CPU-bound algorithms you are going to get faster
speeds than MLlib's JBLAS. However, there's in theory nothing preventing the
use of these in MLlib (e.g. if you have a faster BLAS locally;
Hi Martin,
Job ads are actually not allowed on the list, but thanks for asking. Just
posting this for others' future reference.
Matei
On July 29, 2014 at 8:34:59 AM, Martin Goodson (mar...@skimlinks.com) wrote:
I'm not sure if job adverts are allowed on here - please let me know if not.
Java is very close to Scala across the board, the only thing missing in it
right now is GraphX (which is still alpha). Python is missing GraphX, streaming
and a few of the ML algorithms, though most of them are there. So it should be
fine to start with any of them. See
This should be okay, but make sure that your cluster also has the right code
deployed. Maybe you have the wrong one.
If you built Spark from source multiple times, you may also want to try sbt
clean before sbt assembly.
Matei
On August 1, 2014 at 12:00:07 PM, SK (skrishna...@gmail.com) wrote:
This looks pretty comprehensive to me. A few quick suggestions:
- On the VM part: we've actually been avoiding this in all the Databricks
training efforts because the VM itself can be annoying to install and it makes
it harder for people to really use Spark for development (they can learn it,
To get the ClassTag object inside your function with the original syntax you
used (T: ClassTag), you can do this:
def read[T: ClassTag](): T = {
val ct = classTag[T]
ct.runtimeClass.newInstance().asInstanceOf[T]
}
Passing the ClassTag with : ClassTag lets you have an implicit parameter that
Emails sent from Nabble have it, while others don't. Unfortunately I haven't
received a reply from ASF infra on this yet.
Matei
On August 5, 2014 at 2:04:10 PM, Nicholas Chammas (nicholas.cham...@gmail.com)
wrote:
Looks like this feature has been turned off. Are these changes intentional? Or
Oh actually sorry, it looks like infra has looked at it but they can't add
permalinks. They can only add here's how to unsubscribe footers. My bad, I
just didn't catch the email update from them.
Matei
On August 5, 2014 at 2:39:45 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote:
Emails sent
Your map-only job should not be shuffling, but if you want to see what's
running, look at the web UI at http://driver:4040. In fact the job should not
even write stuff to disk except inasmuch as the Hadoop S3 library might build
up blocks locally before sending them on.
My guess is that it's
Good question; I don't know of one but I believe people at Cloudera had some
thoughts of porting Sqoop to Spark in the future, and maybe they'd consider
DistCP as part of this effort. I agree it's missing right now.
Matei
On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com)
What is your Spark executor memory set to? (You can see it in Spark's web UI at
http://driver:4040 under the executors tab). One thing to be aware of is that
the JVM never really releases memory back to the OS, so it will keep filling up
to the maximum heap size you set. Maybe 4 executors with
Thanks for sharing this, Brandon! Looks like a great architecture for people to
build on.
Matei
On August 15, 2014 at 2:07:06 PM, Brandon Amos (a...@adobe.com) wrote:
Hi Spark community,
At Adobe Research, we're happy to open source a prototype
technology called Spindle we've been
TypeTags are unfortunately not thread-safe in Scala 2.10. They were still
somewhat experimental at the time so we decided not to use them. If you want
though, you can probably design other APIs that pass a TypeTag around (e.g.
make a method that takes an RDD[T] but also requires an implicit
You should be able to just download / unzip a Spark release and run it on a
Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd.
The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on
Windows, but you can launch a standalone cluster manually using
Have you tried the pipe() operator? It should work if you can launch your
script from the command line. Just watch out for any environment variables
needed (you can pass them to pipe() as an optional argument if there are some).
On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa
It seems to be because you went there with https:// instead of http://. That
said, we'll fix it so that it works on both protocols.
Matei
On August 25, 2014 at 1:56:16 PM, Nick Chammas (nicholas.cham...@gmail.com)
wrote:
https://spark.apache.org/screencasts/1-first-steps-with-spark.html
The
It should be fixed now. Maybe you have a cached version of the page in your
browser. Open DevTools (cmd-shift-I), press the gear icon, and check disable
cache while devtools open, then refresh the page to refresh without cache.
Matei
On August 26, 2014 at 7:31:18 AM, Nicholas Chammas
Is this a standalone mode cluster? We don't currently make this guarantee,
though it will likely work in 1.0.0 to 1.0.2. The problem though is that the
standalone mode grabs the executors' version of Spark code from what's
installed on the cluster, while your driver might be built against
You can use sc.wholeTextFiles to read each file as a complete String, though it
requires each file to be small enough for one task to process.
On August 26, 2014 at 4:01:45 PM, Chris Fregly (ch...@fregly.com) wrote:
i've seen this done using mapPartitions() where each partition represents a
You should try to find a Java-based library, then you can call it from Scala.
Matei
On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote:
Hi I am trying to find a CUDA library in Scala, to see if some matrix
manipulation in MLlib can be sped up.
I googled a few but found no
connect to an existing 1.0.0 cluster and see
what what happens...
Thanks, Matei :)
On Tue, Aug 26, 2014 at 6:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Is this a standalone mode cluster? We don't currently make this guarantee,
though it will likely work in 1.0.0 to 1.0.2. The problem
I think this will increasingly be its role, though it doesn't make sense to use
it to core because it is clearly just a client of the core APIs. What usage do
you have in mind in particular? It would be nice to know how the non-SQL APIs
for this could be better.
Matei
On August 27, 2014 at
You can use spark-shell -i file.scala to run that. However, that keeps the
interpreter open at the end, so you need to make your file end with
System.exit(0) (or even more robustly, do stuff in a try {} and add that in
finally {}).
In general it would be better to compile apps and run them
Awesome to hear this, Mayur! Thanks for putting this together.
Matei
On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com)
wrote:
Hi,
We have migrated Pig functionality on top of Spark passing 100% e2e for success
cases in pig test suite. That means UDF, Joins other
Yes, executors run one task per core of your machine by default. You can also
manually launch them with more worker threads than you have cores. What cluster
manager are you on?
Matei
On August 29, 2014 at 11:24:33 AM, Victor Tso-Guillen (v...@paxata.com) wrote:
I'm thinking of local mode
In 1.1, you'll be able to get all of these properties using sortByKey, and then
mapPartitions on top to iterate through the key-value pairs. Unfortunately
sortByKey does not let you control the Partitioner, but it's fairly easy to
write your own version that does if this is important.
In
. does it apply to both sides of the join, or only one
(while othe other side is streaming)?
On Sat, Aug 30, 2014 at 1:30 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
In 1.1, you'll be able to get all of these properties using sortByKey, and then
mapPartitions on top to iterate through the key
, Steve Lewis (lordjoe2...@gmail.com) wrote:
Is there a sample of how to do this -
I see 1.1 is out but cannot find samples of mapPartitions
A Java sample would be very useful
On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
In 1.1, you'll be able to get all
Hi Mohit,
This looks pretty interesting, but just a note on the implementation -- it
might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The
reason is that SchemaRDDs already have an efficient in-memory representation
(columnar storage), and can be read from a variety of data
and what do I expect to see in switching from one partition to
another as the code runs?
On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
In 1.1, you'll be able to get all of these properties using sortByKey, and then
mapPartitions on top to iterate through the key-value
BTW you can also use rdd.partitions() to get a list of Partition objects and
see how many there are.
On September 4, 2014 at 5:18:30 PM, Matei Zaharia (matei.zaha...@gmail.com)
wrote:
Partitioners also work in local mode, the only question is how to see which
data fell into each partition
Thanks to everyone who contributed to implementing and testing this release!
Matei
On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote:
Thanks for all the good work. Very excited about seeing more features and
better stability in the framework.
On Thu, Sep 11, 2014 at
Hi Du,
I don't think NullWritable has ever been serializable, so you must be doing
something differently from your previous program. In this case though, just use
a map() to turn your Writables to serializable types (e.g. null and String).
Matie
On September 12, 2014 at 8:48:36 PM, Du Li
I've seen the file name too long error when compiling on an encrypted Linux
file system -- some of them have a limit on file name lengths. If you're on
Linux, can you try compiling inside /tmp instead?
Matei
On September 13, 2014 at 10:03:14 PM, Yin Huai (huaiyin@gmail.com) wrote:
Can you
Scala 2.11 work is under way in open pull requests though, so hopefully it will
be in soon.
Matei
On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote:
ah...thanks!
On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote:
No, not yet. Spark SQL is
at the earliest.
On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Scala 2.11 work is under way in open pull requests though, so hopefully it will
be in soon.
Matei
On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote:
ah...thanks!
On Mon, Sep 15
It's true that it does not send a kill command right now -- we should probably
add that. This code was written before tasks were killable AFAIK. However, the
*job* should still finish while a speculative task is running as far as I know,
and it will just leave that task behind.
Matei
On
.count(). As you can
see, count() does not need to serialize and ship data while the other three
methods do.
Do you recall any difference between spark 1.0 and 1.1 that might cause this
problem?
Thanks,
Du
From: Matei Zaharia matei.zaha...@gmail.com
Date: Friday, September 12, 2014 at 9:10 PM
sortByKey is indeed O(n log n), it's a first pass to figure out even-sized
partitions (by sampling the RDD), then a second pass to do a distributed
merge-sort (first partition the data on each machine, then run a reduce phase
that merges the data for each partition). The point where it becomes
If you want to run the computation on just one machine (using Spark's local
mode), it can probably run in a container. Otherwise you can create a
SparkContext there and connect it to a cluster outside. Note that I haven't
tried this though, so the security policies of the container might be too
I'm pretty sure it does help, though I don't have any numbers for it. In any
case, Spark will automatically benefit from this if you link it to a version of
HDFS that contains this.
Matei
On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com) wrote:
Cloudera had a blog post
Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads
one RDD partition at a time. Scala iterators also have methods like grouped()
that let you get fixed-size groups.
Matei
On September 18, 2014 at 7:58:34 PM, dave-anderson (david.ander...@pobox.com)
wrote:
I have an
File takes a filename to write to, while Dataset takes only a JobConf. This
means that Dataset is more general (it can also save to storage systems that
are not file systems, such as key-value stores), but is more annoying to use if
you actually have a file.
Matei
On September 21, 2014 at
Is your file managed by Hive (and thus present in a Hive metastore)? In that
case, Spark SQL
(https://spark.apache.org/docs/latest/sql-programming-guide.html) is the
easiest way.
Matei
On September 23, 2014 at 2:26:10 PM, Pramod Biligiri (pramodbilig...@gmail.com)
wrote:
Hi,
I'm trying to
Pretty cool, thanks for sharing this! I've added a link to it on the wiki:
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects.
Matei
On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko...@tresata.com wrote:
well, sort of! we make input/output formats (cascading taps,
It should just work in PySpark, the same way it does in Java / Scala apps.
Matei
On Oct 1, 2014, at 4:12 PM, Sungwook Yoon sy...@maprtech.com wrote:
Yes.. you should use maprfs://
I personally haven't used pyspark, I just used scala shell or standalone with
MapR.
I think you need to
You need to set --total-executor-cores to limit how many total cores it grabs
on the cluster. --executor-cores is just for each individual executor, but it
will try to launch many of them.
Matei
On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.INVALID wrote:
hey
The issue is that you're using SQLContext instead of HiveContext. SQLContext
implements a smaller subset of the SQL language and so you're getting a SQL
parse error because it doesn't support the syntax you have. Look at how you'd
write this in HiveQL, and then try doing that with HiveContext.
I'm pretty sure inner joins on Spark SQL already build only one of the sides.
Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer
joins do both, and it seems like we could optimize it for those that are not
full.
Matei
On Oct 7, 2014, at 11:04 PM, Haopu Wang
A SchemaRDD is still an RDD, so you can just do rdd.map(row = row.toString).
Or if you want to get a particular field of the row, you can do rdd.map(row =
row(3).toString).
Matei
On Oct 9, 2014, at 1:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote:
I've a SchemaRDD that I want to
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some pretty
cool news for the project, which is that we've been able to use Spark to break
MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer
nodes. There's a detailed writeup at
Added you, thanks! (You may have to shift-refresh the page to see it updated).
Matei
On Oct 10, 2014, at 1:52 PM, Michael Oczkowski michael.oczkow...@seeq.com
wrote:
Please add the Boulder-Denver Spark meetup group to the list on the website.
Very cool Denny, thanks for sharing this!
Matei
On Oct 11, 2014, at 9:46 AM, Denny Lee denny.g@gmail.com wrote:
https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
If you're wondering how to connect Tableau to SparkSQL - here are the steps
to connect Tableau to SparkSQL.
101 - 200 of 284 matches
Mail list logo