At this time, the JDBC Data source is not extensible so it cannot support
SQL Server. There was some thoughts - credit to Cheng Lian for this -
about making the JDBC data source extensible for third party support
possibly via slick.
On Mon, Apr 6, 2015 at 10:41 PM bipin bipin@gmail.com
Hi,
In spark, there are two settings regarding number of cores, one is at task
level :spark.task.cpus
and there is another one, which drives number of cores per executors:
spark.executor.cores
Apart from using more than one core for a task which has to call some other
external API etc, is there
My Spark streaming application processes the data received in each interval.
In Spark Stages UI, all the stages are pointed to single line of code*
windowDStream.foreachRDD* only (not the actions inside the DStream)
- Following is the information from Spark Stages UI page:
Stage Id
Lets say I follow below approach and I got RddPair with huge size .. which
can not fit into one machine ... what to run foreach on this RDD?
On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote:
On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote:
On Mon,
Thanks for the information. Hopefully this will happen in near future. For
now my best bet would be to export data and import it in spark sql.
On 7 April 2015 at 11:28, Denny Lee denny.g@gmail.com wrote:
At this time, the JDBC Data source is not extensible so it cannot support
SQL Server.
Thanks for the information Andy. I will go through the versions mentioned in
Dependencies.scala to identify the compatibility.
Regards,
Manish
From: andy petrella [mailto:andy.petre...@gmail.com]
Sent: Tuesday, April 07, 2015 11:04 AM
To: Manish Gupta 8; user@spark.apache.org
Subject: Re:
If I try to build spark-notebook with spark.version=1.2.0-cdh5.3.0, sbt
throw these warnings before failing to compile:
:: org.apache.spark#spark-yarn_2.10;1.2.0-cdh5.3.0: not found
:: org.apache.spark#spark-repl_2.10;1.2.0-cdh5.3.0: not found
Any suggestions?
Thanks
From: Manish Gupta 8
Mmmh, you want it running a spark 1.2 with hadoop 2.5.0-cdh5.3.2 right?
If I'm not wrong you might have to launch it like so:
```
sbt -Dspark.version=1.2.0 -Dhadoop.version=2.5.0-cdh5.3.2
```
Or you can download it from http://spark-notebook.io if you want.
HTH
andy
On Tue, Apr 7, 2015 at
Hi,
One of the rational behind killing the app can be to avoid skewness in
data.
I have created this issue (https://issues.apache.org/jira/browse/SPARK-6735)
to provide options for disabling this behaviour, as well as making the
number of executor's failure to be relative with respect to a
Hi all:
I am using spark streaming(1.3.1) as a long time running service and out
of memory after running for 7 days.
I found that the field *authorizedCommittersByStage* in
*OutputCommitCoordinator* class cause the OOM.
authorizedCommittersByStage is a map, key is StageId, value is
Hi, Experts
I run my Spark Cluster on Yarn. I used to get executors' Logs from Spark's
History Server. But after I started my Hadoop jobhistory server and made
configuration to aggregate logs of hadoop jobs to a HDFS directory, I found
that I could not get spark's executors' Logs any more. Is
On 6 Apr 2015, at 23:05, Patrick Young
patrick.mckendree.yo...@gmail.commailto:patrick.mckendree.yo...@gmail.com
wrote:
does anyone have any thoughts on storing a really large raster in HDFS? Seems
like if I just dump the image into HDFS as it, it'll get stored in blocks all
across the
Or you could build an uber jar ( you could google that )
https://eradiating.wordpress.com/2015/02/15/getting-spark-streaming-on-kafka-to-work/
--- Original Message ---
From: Akhil Das ak...@sigmoidanalytics.com
Sent: April 4, 2015 11:52 PM
To: Priya Ch learnings.chitt...@gmail.com
Cc:
Hello,
The old api of GraphX mapReduceTriplets has an optional parameter
activeSetOpt: Option[(VertexRDD[_] that limit the input of sendMessage.
However, to the new api aggregateMessages I could not find this option,
why it does not offer any more?
Alcaid
Heya,
You might be interesting at looking at GeoTrellis
They use RDDs of Tiles to process big images like Landsat ones can be
(specially 8).
However, I see you have only 1G per file, so I guess you only care of a
single band? Or is it a reboxed pic?
Note: I think the GeoTrellis image format is
In my dev-test env .I have 3 virtual machines ,every machine have 12G
memory,8 cpu core.
Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not right.
I run this command :*spark-submit --master yarn-client --driver-memory 7g
--executor-memory 6g /home/hadoop/spark/main.py*
Hi all,
I've already opened a bug on Jira some days ago [1] but I'm starting
thinking this is not the correct way to go since I haven't got any news
about it yet.
Let me try to explain it briefly: with pyspark, trying to cogroup two input
files with different schemas lead (nondeterministically)
Using spark(1.2) streaming to read avro schema based topics flowing in kafka
and then using spark sql context to register data as temp table. Avro maven
plugin(1.7.7 version) generates the java bean class for the avro file but
includes a field named SCHEMA$ of type org.apache.avro.Schema which is
Hi ,
Is there any difference between Difference between textFile Vs hadoopFile
(textInoutFormat) when data is present in HDFS? Will there be any performance
gain that can be observed?
Puneet Kumar Ojha
Data Architect | PubMatichttp://www.pubmatic.com/
Hi,
I am building a pipeline and I've read most that I can find on the topic
(spark.ml library and the AMPcamp version of pipelines:
http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html).
I do not have structured data as in the case of the new Spark.ml library
which
There is no difference - textFile calls hadoopFile with a TextInputFormat, and
maps each value to a String.
—
Sent from Mailbox
On Tue, Apr 7, 2015 at 1:46 PM, Puneet Kumar Ojha
puneet.ku...@pubmatic.com wrote:
Hi ,
Is there any difference between Difference between textFile Vs hadoopFile
Hello, guys!
I am a newbie to Spark and would appreciate any advice or help.
Here is the detailed question:
http://stackoverflow.com/questions/29493472/does-spark-utilize-the-sorted-order-of-hbase-keys-when-using-hbase-as-data-sour
Regards,
Yury
There are 500 millions distinct users...
2015-04-07 17:45 GMT+03:00 Ted Yu yuzhih...@gmail.com:
How many distinct users are stored in HBase ?
TableInputFormat produces splits where number of splits matches the number
of regions in a table. You can write your own InputFormat which splits
Foreach() runs in parallel across the cluster, like map, flatMap, etc.
You'll only run into problems if you call collect(), which brings the
entire RDD into memory in the driver program.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do
Hello,
First of all, thank you to everyone working on Spark. I've only been using
it for a few weeks now but so far I'm really enjoying it. You saved me from
a big, scary elephant! :-)
I was wondering if anyone might be able to offer some advice about working
with the Thrift JDBC server? I'm
How many distinct users are stored in HBase ?
TableInputFormat produces splits where number of splits matches the number
of regions in a table. You can write your own InputFormat which splits
according to user id.
FYI
On Tue, Apr 7, 2015 at 7:36 AM, Юра rvaniy@gmail.com wrote:
Hello,
That's correct, at this time MS SQL Server is not supported through the
JDBC data source at this time. In my environment, we've been using Hadoop
streaming to extract out data from multiple SQL Servers, pushing the data
into HDFS, creating the Hive tables and/or converting them into Parquet,
and
Hi all,
I've already opened a bug on Jira some days ago [1] but I'm starting
thinking this is not the correct way to go since I haven't got any news
about it yet.
Let me try to explain it briefly: with pyspark, trying to cogroup two input
files with different schemas lead (nondeterministically)
I am having the same issue with my java application.
String url = jdbc:sqlserver:// + host + :1433;DatabaseName= +
database + ;integratedSecurity=true;
String driver = com.microsoft.sqlserver.jdbc.SQLServerDriver;
SparkConf conf = new
Sorry for reply late.
I bypass this by set _JAVA_OPTIONS.
And the ps aux | grep spark
hadoop 14442 0.6 0.2 34334552 128560 pts/0 Sl+ 14:37 0:01
/usr/java/latest/bin/java org.apache.spark.deploy.SparkSubmitDriverBootstrapper
--driver-memory=5G --executor-memory=10G --master yarn-client
I have found the issue, but I think it is bug.
If I change my class to:
public class ModelSessionBuilder implements Serializable {
/**
*
*/
.
private Properties[] propertiesList;
private
val locations = filelines.map(line = line.split(\t)).map(t =
(t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()
val cartesienProduct=locations.cartesian(locations).map(t=
Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))
Code executes
Then splitting according to user id's is out of the question :-)
On Tue, Apr 7, 2015 at 8:12 AM, Юра rvaniy@gmail.com wrote:
There are 500 millions distinct users...
2015-04-07 17:45 GMT+03:00 Ted Yu yuzhih...@gmail.com:
How many distinct users are stored in HBase ?
TableInputFormat
BTW, just out of curiosity, I checked both the 1.3.0 release assembly
and the spark-core_2.10 artifact downloaded from
http://mvnrepository.com/, and neither contain any references to
anything under org.eclipse (all referenced jetty classes are the
shaded ones under org.spark-project.jetty).
On
Hi,
We have 2 hive tables and want to join one with the other.
Initially, we ran a sql request on HiveContext. But it did not work. It was
blocked on 30/600 tasks.
Then we tried to load tables into two DataFrames, we have encountered the
same problem.
Finally, it works with RDD.join. What we
I thinking to follow the below approach(in my class hbase also return the
same object which i will get in RDD)
.1 First run the flatMapPairf
JavaPairRDDVendorRecord, IterableVendorRecord pairvendorData
=matchRdd.flatMapToPair( new PairFlatMapFunctionVendorRecord,
VendorRecord, VendorRecord(){
Zsolt - what version of Java are you running?
On Mon, Mar 30, 2015 at 7:12 AM, Zsolt Tóth toth.zsolt@gmail.com
wrote:
Thanks for your answer!
I don't call .collect because I want to trigger the execution. I call it
because I need the rdd on the driver. This is not a huge RDD and it's not
Hi All I am running the below code and its running for very long time where
input to flatMapTopair is record of 50K. and I am calling Hbase for 50K
times just a range scan query to should not take time. can anybody guide me
what is wrong here?
JavaPairRDDVendorRecord, IterableVendorRecord
I have a version that works well for Netflix data but now I am validating
on internal datasets..this code will work on matrix factors and sparse
matrices that has rows = 100* columnsif columns are much smaller than
rows then col based flow works well...basically we need both flows...
I did
cartesian is an expensive operation. If you have 'M' records in location, then
locations. cartesian(locations) will generate MxM result.If locations is a big
RDD, it is hard to do the locations. cartesian(locations) efficiently.Yong
Date: Tue, 7 Apr 2015 10:04:12 -0700
From:
It's hard for us to diagnose your performance problems, because we don't
have your environment and fixing one will simply reveal the next one to be
fixed. So, I suggest you use the following strategy to figure out what
takes the most time and hence what you might try to optimize. Try replacing
The joins here are totally different implementations, but it is worrisome
that you are seeing the SQL join hanging. Can you provide more information
about the hang? jstack of the driver and a worker that is processing a
task would be very useful.
On Tue, Apr 7, 2015 at 8:33 AM, Hao Ren
1) What exactly is the relationship between the thrift server and Hive?
I'm guessing Spark is just making use of the Hive metastore to access table
definitions, and maybe some other things, is that the case?
Underneath the covers, the Spark SQL thrift server is executing queries
using a
Hi,
I have a question about Array[T].distinct on customized class T. My data is
a like RDD[(String, Array[T])] in which T is a class written by my class.
There are some duplicates in each Array[T] so I want to remove them. I
override the equals() method in T and use
val dataNoDuplicates =
For more details on my question
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-generate-Java-bean-class-for-avro-files-using-spark-avro-project-tp22413.html
Thanks,
Yamini
On Tue, Apr 7, 2015 at 2:23 PM, Yamini Maddirala yamini.m...@gmail.com
wrote:
Hi Michael,
Yes, I did try
We thought it would be better to simplify the interface, since the
active set is a performance optimization but the result is identical
to calling subgraph before aggregateMessages.
The active set option is still there in the package-private method
aggregateMessagesWithActiveSet. You can actually
Have you looked at spark-avro?
https://github.com/databricks/spark-avro
On Tue, Apr 7, 2015 at 3:57 AM, Yamini yamini.m...@gmail.com wrote:
Using spark(1.2) streaming to read avro schema based topics flowing in
kafka
and then using spark sql context to register data as temp table. Avro maven
The Spark history server does not have the ability to serve executor
logs currently. You need to use the yarn logs command for that.
On Tue, Apr 7, 2015 at 2:51 AM, donhoff_h 165612...@qq.com wrote:
Hi, Experts
I run my Spark Cluster on Yarn. I used to get executors' Logs from Spark's
History
Hi Michael,
Yes, I did try spark-avro 0.2.0 databricks project. I am using CHD5.3 which
is based on spark 1.2. Hence I'm bound to use spark-avro 0.2.0 instead of
the latest.
I'm not sure how spark-avro project can help me in this scenario.
1. I have JavaDStream of type avro generic record
Hi,
I have written a scala object which can do query on the messages which I am
receiving from Kafka.
Now I have to show it on some webpage or dashboard which can auto refresh with
new results.. Any pointer how can I do that..
Thanks,
Mukund
Is there a way to generate Java bean for a given avro schema file in spark
1.2 using spark-avro project 0.2.0 for following use case?
1. Topics from kafka read and stored in the form of avro generic records
:JavaDStreamGenericRecords
2. Using spark avro project able to get the schema in the
Solved! Thanks for ur help. I had converted null values to Double value
(0.0)
El 06/04/2015 19:25, Joseph Bradley jos...@databricks.com escribió:
I'd make sure you're selecting the correct columns. If not that, then
your input data might be corrupt.
CCing user to keep it on the user list.
Maybe you have some sbt-built 1.3 version in your ~/.ivy2/ directory that's
masking the maven one? That's the only explanation I can come up with...
On Tue, Apr 7, 2015 at 12:22 PM, Jacek Lewandowski
jacek.lewandow...@datastax.com wrote:
So weird, as I said - I created a new empty project
That should totally work. The other option would be to run a persistent
metastore that multiple contexts can talk to and periodically run a job
that creates missing tables. The trade-off here would be more complexity,
but less downtime due to the server restarting.
On Tue, Apr 7, 2015 at 12:34
Hi All,
I am a bit confused on spark.storage.memoryFraction, this is used to set the
area for RDD usage, will this RDD means only for cached and persisted RDD?
So if my program has no cached RDD at all (means that I have no .cache() or
.persist() call on any RDD), then I can set this
So weird, as I said - I created a new empty project where Spark core was
the only dependency...
[image: datastax_logo.png] http://www.datastax.com/
JACEK LEWANDOWSKI
DSE Software Engineer | +48.609.810.774 | jacek.lewandow...@datastax.com
[image: linkedin.png]
Hi Michael,
Thanks so much for the reply - that really cleared a lot of things up for
me!
Let me just check that I've interpreted one of your suggestions for (4)
correctly... Would it make sense for me to write a small wrapper app that
pulls in hive-thriftserver as a dependency, iterates my
in the current Programming Guide:
https://spark.apache.org/docs/1.3.0/programming-guide.html#actions
under Actions, the Python link goes to:
https://spark.apache.org/docs/1.3.0/api/python/pyspark.rdd.RDD-class.html
which is 404
which I think should be:
For the last link, you might have meant:
https://spark.apache.org/docs/1.3.0/api/python/pyspark.html#pyspark.RDD
Cheers
On Tue, Apr 7, 2015 at 1:32 PM, jonathangreenleaf
jonathangreenl...@gmail.com wrote:
in the current Programming Guide:
Any help?please.
Help me do a right configure.
李铖 lidali...@gmail.com于2015年4月7日星期二写道:
In my dev-test env .I have 3 virtual machines ,every machine have 12G
memory,8 cpu core.
Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not
right.
I run this command :*spark-submit
I was unable to get this feature to work in 1.3.0. I tried building off master
and it still wasn't working for me. So I dug into the code, and I'm not sure
how the parsePartition() was ever working. The while loop which walks up the
parent directories in the path always terminates after a
It is hard to guess why OOM happens without knowing your application's logic
and the data size.
Without knowing that, I can only guess based on some common experiences:
1) increase spark.default.parallelism2) Increase your executor-memory, maybe
6g is not just enough 3) Your environment is kind
李铖:
w.r.t. #5, you can use --executor-cores when invoking spark-submit
Cheers
On Tue, Apr 7, 2015 at 2:35 PM, java8964 java8...@hotmail.com wrote:
It is hard to guess why OOM happens without knowing your application's
logic and the data size.
Without knowing that, I can only guess based on
Hi Sean,
I didn't override hasCode. But the problem is that Array[T].toSet could
work but Array[T].distinct couldn't. If it is because I didn't override
hasCode, then toSet shouldn't work either right? I also tried using this
Array[T].distinct outside RDD, and it is working alright also,
I've been using Joda Time in all my spark jobs (by using the nscala-time
package) and have not run into any issues until I started trying to use
spark sql. When I try to convert a case class that has a
com.github.nscala_time.time.Imports.DateTime object in it, an exception is
thrown for with a
Hi There,
We’ve just started to trial out Spark at Bitly.
We are running Spark 1.2.1 on Cloudera(CDH-5.3.0) with Hadoop 2.5.0 and am
running into issues even just trying to run the python examples. Its just
being run in standalone mode i believe.
$ ./bin/spark-submit —driver-memory 2g
This could be empirically verified in spark-perf:
https://github.com/databricks/spark-perf. Theoretically, it would be
2x for k-means and logistic regression, because computation is doubled
but communication cost remains the same. -Xiangrui
On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv
The following might be helpful.
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/What-dependencies-to-submit-Spark-jobs-programmatically-not-via/td-p/24721
http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/
On 7 April 2015 at 16:32, michal.klo...@gmail.com
Hi,
Is it possible to specify a Spark property like spark.local.dir from the
command line when running an application using spark-submit?
Thanks,
arun
Hello folks,
Newbie here! Just had a quick question - is there a job submission API such
as the one with hadoop
https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/Job.html#submit()
to submit Spark jobs to a Yarn cluster? I see in example that
bin/spark-submit is what's out
A SparkContext can submit jobs remotely.
The spark-submit options in general can be populated into a SparkConf and
passed in when you create a SparkContext.
We personally have not had too much success with yarn-client remote submission,
but standalone cluster mode was easy to get going.
M
Thank you Xiangrui,
Indeed, however, if the computation involves taking matrix, even locally,
like random forest, if data increases 2x, even local computation time
should increase 2x. But I will test it with the Spark Perf and let you
know!
On Tue, Apr 7, 2015 at 4:50 PM, Xiangrui Meng
Hello,
If you are looking for the command to submit the following command works:
spark-submit --class SampleTest --master yarn-cluster --num-executors
4 --executor-cores
2 /home/priya/Spark/Func1/target/scala-2.10/simple-project_2.10-1.0.jar
On Tue, Apr 7, 2015 at 6:36 PM, Veena Basavaraj
Thanks Michael. Will submit a ticket.
Justin
On Mon, Apr 6, 2015 at 1:53 PM, Michael Armbrust mich...@databricks.com
wrote:
I'll add that I don't think there is a convenient way to do this in the
Column API ATM, but would welcome a JIRA for adding it :)
On Mon, Apr 6, 2015 at 1:45 PM,
Hi,
We are trying to run a Spark application using spark-submit on Windows 8.1.
The application runs successfully to completion on MacOS 10.10 and on
Ubuntu Linux. On Windows, we get the following error messages (see below).
It appears that Spark is trying to delete some temporary directory that
I just figured this out from the documentation:
--conf spark.local.dir=C:\Temp
On Tue, Apr 7, 2015 at 5:00 PM, Arun Lists lists.a...@gmail.com wrote:
Hi,
Is it possible to specify a Spark property like spark.local.dir from the
command line when running an application using spark-submit?
Hello,
Just want to check if anyone has tried drools with Spark? Please let me
know. Are there any alternate rule engine that works well with Spark?
Thanks
Sathish
Hello,
I am experimenting with DataFrame. I tried to construct two DataFrames with:
1. case class A(a: Int, b: String)
scala adf.printSchema()
root
|-- a: integer (nullable = false)
|-- b: string (nullable = true)
2. case class B(a: String, c: Int)
scala bdf.printSchema()
root
|-- a: string
I suppose it depends a lot on the implementations. In general,
distinct and toSet work when hashCode and equals are defined
correctly. When that isn't the case, the result isn't defined; it
might happen to work in some cases. This could well explain why you
see different results. Why not implement
I have a standalone and local Spark streaming process where we are reading
inputs using FlumeUtils. Our longest window size is 6 hours. After about a
day and a half of running without any issues, we start seeing Timeout
errors while cleaning up input blocks. This seems to cause reading from
Flume
+1. I would love to have the code for this as well.
Pramod
On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com wrote:
Hi all,
As we all know, Spark has set the record for sorting data, as published on:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.
Here at our
I fixed this a while ago in master. It should go out with the next
release and next push of the site.
On Tue, Apr 7, 2015 at 4:32 PM, jonathangreenleaf
jonathangreenl...@gmail.com wrote:
in the current Programming Guide:
https://spark.apache.org/docs/1.3.0/programming-guide.html#actions
under
Hi -
I want to create an instance of HiveThriftServer2 in my Scala application, so
I imported the following line:
import org.apache.spark.sql.hive.thriftserver._
However, when I compile the code, I get the following error:
object thriftserver is not a member of package
Awesome. thank you!
On Apr 7, 2015 8:55 PM, Sean Owen so...@cloudera.com wrote:
I fixed this a while ago in master. It should go out with the next
release and next push of the site.
On Tue, Apr 7, 2015 at 4:32 PM, jonathangreenleaf
jonathangreenl...@gmail.com wrote:
in the current
Hello Everyone,
I am trying to implement this example (Spark Streaming with Twitter).
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala
I am able to do:
hashTags.print() to get a live stream of filtered hashtags,
Hey y'all,
While I haven't been able to get Spark + Kinesis integration working, I
pivoted to plan B: I now push data to S3 where I set up a DStream to
monitor an S3 bucket with textFileStream, and that works great.
I 3 Spark!
Best,
Vadim
ᐧ
On Mon, Apr 6, 2015 at 12:23 PM, Vadim Bichutskiy
Hello,
I have a parquet file of around 55M rows (~ 1G on disk). Performing simple
grouping operation is pretty efficient (I get results within 10 seconds).
However, after called DataFrame.cache, I observe a significant performance
degrade, the same operation now takes 3+ minutes.
My hunch is
Hi Justin,
Does the schema of your data have any decimal, array, map, or struct type?
Thanks,
Yin
On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip yipjus...@prediction.io wrote:
Hello,
I have a parquet file of around 55M rows (~ 1G on disk). Performing simple
grouping operation is pretty
The schema has a StructType.
Justin
On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai yh...@databricks.com wrote:
Hi Justin,
Does the schema of your data have any decimal, array, map, or struct type?
Thanks,
Yin
On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip yipjus...@prediction.io
wrote:
Hello,
Hi guys,
Currently I am running Spark program on Amazon EC2. Each worker has around
(less than but near to )2 gb memory.
By default, I can see each worker is allocated 976 mb memory as the table
shows below on Spark WEB UI. I know this value is from (Total memory minus
1 GB). But I want more
I understand that RDDs are not created until an action is called. Is it a
correct conclusion that it doesn't matter if .cache is used anywhere in
the program if I only have one action that is called only once?
Related to this question, consider this situation:
val d1 = data.map((x,y,z) = (x,y))
Hello,
I have two HDFS directories each containing multiple avro files. I want to
specify these two directories as input. In Hadoop world, one can specify
list of comma separated directories. In Spark that does not work.
Logs
15/04/07 21:10:11 INFO storage.BlockManagerMaster: Updated info
Spark Version 1.3
Command:
./bin/spark-submit -v --master yarn-cluster --driver-class-path
Thanks for the explanation Yin.
Justin
On Tue, Apr 7, 2015 at 7:36 PM, Yin Huai yh...@databricks.com wrote:
I think the slowness is caused by the way that we serialize/deserialize
the value of a complex type. I have opened
https://issues.apache.org/jira/browse/SPARK-6759 to track the
94 matches
Mail list logo