Hi Binh,
It stores the state as well the unprocessed data. It is a subset of the
records that you aggregated so far.
This provides a good reference for checkpointing.
http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#checkpointing
On Wed, Mar 18, 2015 at 12:52 PM, Binh
You can always throw more machines at this and see if the performance is
increasing. Since you haven't mentioned anything regarding your # cores etc.
Thanks
Best Regards
On Wed, Mar 18, 2015 at 11:42 AM, nvrs nvior...@gmail.com wrote:
Hi all,
We are having a few issues with the performance
Hi Akhil,
Yes, that's what we are planning on doing at the end of the data. At the
moment I am doing performance testing before the job hits production and
testing on 4 cores to get baseline figures and deduced that in order to
grow to 10 - 15 million keys we ll need at batch interval of ~20 secs
Trying to build recommendation system using Spark MLLib's ALS.
Currently, we're trying to pre-build recommendations for all users on daily
basis. We're using simple implicit feedbacks and ALS.
The problem is, we have 20M users and 30M products, and to call the main
predict() method, we need to
Hi everybody,
When trying to upgrade from Spark 1.1.1 to Spark 1.2.x (tried both 1.2.0
and 1.2.1) I encounter a weird error never occurred before about which I'd
kindly ask for any possible help.
In particular, all my Spark SQL queries fail with the following exception:
You can simply turn it on using:
./sbin/start-history-server.sh
Read more here http://spark.apache.org/docs/1.3.0/monitoring.html.
Thanks
Best Regards
On Wed, Mar 18, 2015 at 4:00 PM, patcharee patcharee.thong...@uni.no
wrote:
Hi,
I am using spark 1.3. I would like to use Spark Job
I don't think this is the problem, but I think you'd also want to set
-Dhadoop.version= to match your deployment version, if you're building
for a particular version, just to be safe-est.
I don't recall seeing that particular error before. It indicates to me
that the SparkContext is null. Is this
Probably 1.3.0 - it has some improvements in the included Kafka receiver
for streaming.
https://spark.apache.org/releases/spark-release-1-3-0.html
Regards,
Jeff
2015-03-18 10:38 GMT+01:00 James King jakwebin...@gmail.com:
Hi All,
Which build of Spark is best when using Kafka?
Regards
jk
Thanks, Shao
On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai saisai.s...@intel.com wrote:
Yeah, as I said your job processing time is much larger than the sliding
window, and streaming job is executed one by one in sequence, so the next
job will wait until the first job is finished, so the
I turned it on. But it failed to start. In the log,
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Spark Command: /usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -cp
Did you try ssh tunneling instead of SOCKS?
Thanks
Best Regards
On Wed, Mar 18, 2015 at 5:45 AM, Kelly, Jonathan jonat...@amazon.com
wrote:
I'm trying to figure out how I might be able to use Spark with a SOCKS
proxy. That is, my dream is to be able to write code in my IDE then run it
Thanks Jeff, I'm planning to use it in standalone mode, OK will use hadoop
2.4 package. Chao!
On Wed, Mar 18, 2015 at 10:56 AM, Jeffrey Jedele jeffrey.jed...@gmail.com
wrote:
What you call sub-category are packages pre-built to run on certain
Hadoop environments. It really depends on where
Hi:
I need to count some Game Player Events in the game.
Such as : How Many Players stay in the game scene 1--Save the
Princess from a Dragon
Moneys they have paid in the last 5 min
How many players pay money for go through this scene
much more
Would you mind to provide the query? If it's confidential, could you
please help constructing a query that reproduces this issue?
Cheng
On 3/18/15 6:03 PM, Roberto Coluccio wrote:
Hi everybody,
When trying to upgrade from Spark 1.1.1 to Spark 1.2.x (tried both
1.2.0 and 1.2.1) I encounter a
Hi Mas,
I never actually worked with GraphX, but one idea:
As far as I know, you can directly access the vertex and edge RDDs of your
Graph object. Why not simply run a .filter() on the edge RDD to get all
edges that originate from or end at your vertex?
Regards,
Jeff
2015-03-18 10:52 GMT+01:00
I think you can disable it with spark.shuffle.spill=false
Thanks
Best Regards
On Wed, Mar 18, 2015 at 3:39 PM, Darren Hoo darren@gmail.com wrote:
Thanks, Shao
On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai saisai.s...@intel.com
wrote:
Yeah, as I said your job processing time is much
What you call sub-category are packages pre-built to run on certain
Hadoop environments. It really depends on where you want to run Spark. As
far as I know, this is mainly about the included HDFS binding - so if you
just want to play around with Spark, any of the packages should be fine. I
I've already done that:
From SparkUI Environment Spark properties has:
spark.shuffle.spillfalse
On Wed, Mar 18, 2015 at 6:34 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I think you can disable it with spark.shuffle.spill=false
Thanks
Best Regards
On Wed, Mar 18, 2015 at 3:39 PM,
You are not having yarn package in the classpath. You need to build your
spark it with yarn. You can read these docs.
http://spark.apache.org/docs/1.3.0/running-on-yarn.html
Thanks
Best Regards
On Wed, Mar 18, 2015 at 4:07 PM, patcharee patcharee.thong...@uni.no
wrote:
I turned it on. But it
Hi,
I am using spark 1.3. I would like to use Spark Job History Server. I
added the following line into conf/spark-defaults.conf
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.history.provider
org.apache.spark.deploy.yarn.history.YarnHistoryProvider
Hi,
If you do cartesian join to predict users' preference over all the
products, I think that 8 nodes with 64GB ram would not be enough for the
data.
Recently, I used als for a similar situation, but just 10M users and 0.1M
products, the minimum requirement is 9 nodes with 10GB RAM.
Moreover,
Thanks much for your reply.
By saying on the fly, you mean caching the trained model, and querying it
for each user joined with 30M products when needed?
Our question is more about the general approach, what if we have 7M DAU?
How the companies deal with that using Spark?
On Wed, Mar 18, 2015
Hi all,
We are using spark-assembly-1.2.0-hadoop 2.0.0-mr1-cdh4.2.0.jar in our
application. When we try to deploy the application on Jetty
(jetty-distribution-9.2.10.v20150310) we get the below exception at the
server startup.
Initially we were getting the below exception,
Caused by:
Hi,
My spark was compiled with yarn profile, I can run spark on yarn without
problem.
For the spark job history server problem, I checked
spark-assembly-1.3.0-hadoop2.4.0.jar and found that the package
org.apache.spark.deploy.yarn.history is missing. I don't know why
BR,
Patcharee
hey guys,
In my understanding SparkSQL only supports JDBC connection through hive thrift
server, is this correct?
Thanks
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
From the log you pasted I think this (-rw-r--r-- 1 root root 80K Mar 18
16:54 shuffle_47_519_0.data) is not shuffle spilled data, but the final
shuffle result. As I said, did you think shuffle is the bottleneck which makes
your job running slowly? Maybe you should identify the cause at first.
Hi there,
I was trying the new DataFrame API with some basic operations on a parquet
dataset.
I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a
standalone cluster mode.
The code is the following:
val people = sqlContext.parquetFile(/data.parquet);
val res =
Yes, I have been using Spark SQL from the onset. Haven't found any other
Server for Spark SQL for JDBC connectivity.
On Wed, Mar 18, 2015 at 5:50 PM, sequoiadb mailing-list-r...@sequoiadb.com
wrote:
hey guys,
In my understanding SparkSQL only supports JDBC connection through hive
thrift
Not just the join, but this means you're trying to compute 600
trillion dot products. It will never finish fast. Basically: don't do
this :) You don't in general compute all recommendations for all
users, but recompute for a small subset of users that were or are
likely to be active soon. (Or
Hi all,
Trying to build recommendation system using Spark MLLib's ALS.
Currently, we're trying to pre-build recommendations for all users on daily
basis. We're using simple implicit feedbacks and ALS.
The problem is, we have 20M users and 30M products, and to call the main
predict() method, we
On Wed, Mar 18, 2015 at 8:31 PM, Shao, Saisai saisai.s...@intel.com wrote:
From the log you pasted I think this (-rw-r--r-- 1 root root 80K Mar
18 16:54 shuffle_47_519_0.data) is not shuffle spilled data, but the
final shuffle result.
why the shuffle result is written to disk?
As I
when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the
srcAttr and dstAttr in graph.triplets don't update when using the
Graph.outerJoinVertices(when the data in vertex is changed).
the code and the log is as follows:
g = graph.outerJoinVertices()...
g,vertices,count()
Those classes are not part of standard Spark. You may want to contact
Hortonworks directly if they're suggesting you use those.
On Wed, Mar 18, 2015 at 3:30 AM, patcharee patcharee.thong...@uni.no wrote:
Hi,
I am using spark 1.3. I would like to use Spark Job History Server. I added
the
I don't think that you need memory to put the whole joined data set in
memory. However memory is unlikely to be the limiting factor, it's the
massive shuffle.
OK, you really do have a large recommendation problem if you're
recommending for at least 7M users per day!
My hunch is that it won't be
I think hsy541 is still confused by what is still confusing to me. Namely,
what is the value that sentence Each RDD in a DStream contains data from a
certain interval is speaking of? This is from the Discretized Streams
You should probably increase executor memory by setting
spark.executor.memory.
Full list of available configurations can be found here
http://spark.apache.org/docs/latest/configuration.html
Cheng
On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:
Hi there,
I was trying the new DataFrame API with
I suspect that you hit this bug
https://issues.apache.org/jira/browse/SPARK-6250, it depends on the
actual contents of your query.
Yin had opened a PR for this, although not merged yet, it should be a
valid fix https://github.com/apache/spark/pull/5078
This fix will be included in 1.3.1.
There is also a batch prediction API in PR
https://github.com/apache/spark/pull/3098
Idea here is what Sean said...don't try to reconstruct the whole matrix
which will be dense but pick a set of users and calculate topk
recommendations for them using dense level 3 blas.we are going to merge
Yes
On 3/18/15 8:20 PM, sequoiadb wrote:
hey guys,
In my understanding SparkSQL only supports JDBC connection through hive thrift
server, is this correct?
Thanks
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
Thanks gen for helpful post.
Thank you Sean, we're currently exploring this world of recommendations
with Spark, and your posts are very helpful to us.
We've noticed that you're a co-author of Advanced Analytics with Spark,
just not to get to deep into offtopic, will it be finished soon?
On Wed,
Hi,
I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on
a dataset that looks like (Entity, Attribute, Value) after transforming the
same to a row-oriented dense matrix format (one line per Attribute, one column
per Entity, each cell with normalized value – between 0
You know, I actually have one of the columns called timestamp ! This may
really cause the problem reported in the bug you linked, I guess.
On Wed, Mar 18, 2015 at 3:37 PM, Cheng Lian lian.cs@gmail.com wrote:
I suspect that you hit this bug
https://issues.apache.org/jira/browse/SPARK-6250,
Hi Cheng, thanks for your reply.
The query is something like:
SELECT * FROM (
SELECT m.column1, IF (d.columnA IS NOT null, d.columnA, m.column2), ...,
m.columnN FROM tableD d RIGHT OUTER JOIN tableM m on m.column2 = d.columnA
WHERE m.column2!=\None\ AND d.columnA!=\\
UNION ALL
SELECT
I started to play with 1.3.0 and found that there are a lot of breaking
changes. Previously, I could do the following:
case class Foo(x: Int)
val rdd = sc.parallelize(List(Foo(1)))
import sqlContext._
rdd.registerTempTable(foo)
Now, I am not able to directly use my RDD object and
Hi there, I set the executor memory to 8g but it didn't help
On 18 March 2015 at 13:59, Cheng Lian lian.cs@gmail.com wrote:
You should probably increase executor memory by setting
spark.executor.memory.
Full list of available configurations can be found here
Hey Cheng, thank you so much for your suggestion, the problem was actually
a column/field called timestamp in one of the case classes!! Once I
changed its name everything worked out fine again. Let me say it was kinda
frustrating ...
Roberto
On Wed, Mar 18, 2015 at 4:07 PM, Roberto Coluccio
In case it would interest other peoples, here is what I come up with and it
seems to work fine:
case class RDDAsInputStream(private val rdd: RDD[String]) extends
java.io.InputStream {
var bytes = rdd.flatMap(_.getBytes(UTF-8)).toLocalIterator
def read(): Int = {
if(bytes.hasNext)
Thanks for the information. Will rebuild with 0.6.0 till the patch is
merged.
On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote:
Ranga:
Take a look at https://github.com/apache/spark/pull/4867
Cheers
On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com
It appears that the metastore_db problem is related to
https://issues.apache.org/jira/browse/SPARK-4758. I had another shell open
that was stuck. This is probably a bug, though?
import sqlContext.implicits
case class Foo(x: Int)
val rdd = sc.parallelize(List(Foo(1)))
rdd.toDF
sc.getPersistentRDDs(0).asInstanceOf[RDD[Array[Double]]]
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-cached-RDD-tp22114p22122.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
/user/hive/warehouse is a hdfs location.
I’ve changed the mod for this location but I’m still having the same issue.
hduser@hadoop01-VirtualBox:/opt/spark/bin$ hdfs dfs -chmod -R 777 /user/hive
hduser@hadoop01-VirtualBox:/opt/spark/bin$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
I am trying to understand about mapPartitions but i am still not sure how it
works
in the below example it create three partition
val parallel = sc.parallelize(1 to 10, 3)
and when we do below
parallel.mapPartitions( x = List(x.next).iterator).collect
it prints value
Array[Int] = Array(1, 4,
Here is what I think:
mapPartitions is for a specialized map that is called only once for each
partition. The entire content of the respective partitions is available as a
sequential stream of values via the input argument (Iterarator[T]). The
combined result iterators are automatically
Map partitions works as follows :
1) For each partition of your RDD, it provides an iterator over the values
within that partition
2) You then define a function that operates on that iterator
Thus if you do the following:
val parallel = sc.parallelize(1 to 10, 3)
parallel.mapPartitions( x =
Hi Reza,
I have tried threshold to be only in the range of 0 to 1. I was not aware that
threshold can be set to above 1.
Will try and update.
Thank You
- Manish
From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Hi Roberto,
For now, if the timestamp is a top level column (not a field in a
struct), you can use use backticks to quote the column name like `timestamp
`.
Thanks,
Yin
On Wed, Mar 18, 2015 at 12:10 PM, Roberto Coluccio
roberto.coluc...@gmail.com wrote:
Hey Cheng, thank you so much for your
I was wondering what people generally do about doing database operations from
executor nodes. I’m (at least for now) avoiding doing database updates from
executor nodes to avoid proliferation of database connections on the cluster.
The general pattern I adopt is to collect queries (or tuples)
Hi All,
I am using Saprk version 1.2 running locally. When I try to read a paquet
file I get below exception, what might be the issue?
Any help will be appreciated. This is the simplest operation/action on a
parquet file.
//code snippet//
val sparkConf = new SparkConf().setAppName(
To answer your first question - yes 1.3.0 did break backward compatibility for
the change from SchemaRDD - DataFrame. SparkSQL was an alpha component so api
breaking changes could happen. It is no longer an alpha component as of 1.3.0
so this will not be the case in future.
Adding toDF
I have used various version of spark (1.0, 1.2.1) without any issues . Though I
have not significantly used kafka with 1.3.0 , a preliminary testing revealed
no issues .
- khanderao
On Mar 18, 2015, at 2:38 AM, James King jakwebin...@gmail.com wrote:
Hi All,
Which build of Spark is
Sean, you are exactly right, as I learned from parsing your earlier reply
more carefully -- sorry I didn't do this the first time.
Setting hadoop.version was indeed the solution
./make-distribution.sh --tgz -Pyarn -Phadoop-2.4 -Phive -Phive-thriftserver
-Dhadoop.version=2.5.0-cdh5.3.2
Thanks
I'm coming from a Hadoop background but I'm totally new to Apache Spark. I'd
like to do topic modeling using LDA algorithm on some txt files. The example
on the Spark website assumes that the input to the LDA is a file containing
the words counts. I wonder if someone could help me figuring out the
Hi Yin,
Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The
number of tasks launched is equal to the number of parquet files. Do you
have any idea on how to deal with this situation?
Thanks a lot
On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote:
Seems there are
Hi Ted,
The spark executors and hbase regions/masters are all collocated. This is a 2
node test environment.
Best,
Eric
Eric Walk, Sr. Technical Consultant
p: 617.855.9255 | NASDAQ: PRFT | Perficient.comhttp://www.perficient.com/
From: Ted Yu yuzhih...@gmail.com
Sent: Mar 18, 2015
List(x.next).iterator is giving you the first element from each partition,
which would be 1, 4 and 7 respectively.
On 3/18/15, 10:19 AM, ashish.usoni ashish.us...@gmail.com wrote:
I am trying to understand about mapPartitions but i am still not sure how
it
works
in the below example it create
Hi, Saisai
Here is the duration of one of the jobs, 22 seconds in total, it is longer
than the sliding window.
Stage Id Description Submitted Duration Tasks:
Succeeded/Total Input Output Shuffle Read Shuffle Write
342foreach at SimpleApp.scala:58 2015/03/18
Hi Arush,
Thank you for answering!
When you say checkpoints hold metadata and Data, what is the Data? Is it
the Data that is pulled from input source or is it the state?
If it is state then is it the same number of records that I aggregated
since beginning or only a subset of it? How can I limit
Yeah, as I said your job processing time is much larger than the sliding
window, and streaming job is executed one by one in sequence, so the next job
will wait until the first job is finished, so the total latency will be
accumulated.
I think you need to identify the bottleneck of your job at
Seems the elasticsearch-hadoop project was built with an old version of Spark,
and then you upgraded the Spark version in execution env, as I know the
StructField changed the definition in Spark 1.2, can you confirm the version
problem first?
From: Todd Nist [mailto:tsind...@gmail.com]
Sent:
Hi ,
I am generating PCA using spark .
But I dont know how to save it to disk or visualize it.
Can some one give me some pointerspl.
Thanks
-Roni
Sun,
Just want to confirm that it was in fact an authentication issue.
The issue is resolved now and I can see my tables through Simba ODBC driver.
Thanks a lot.
Shahdad
From: fightf...@163.com [mailto:fightf...@163.com]
Sent: March-17-15 6:33 PM
To: Shahdad Moradi; user
Subject: Re:
Thanks for the quick response.
The spark server is spark-1.2.1-bin-hadoop2.4 from the Spark download. Here
is the startup:
radtech$ ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to
Sorry if this is a total noob question but is there a reason why I'm only
seeing folks' responses to my posts in emails but not in the browser view
under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of
setting your preferences such that your responses only go to email and never
Please see the inline comments.
Thanks
Jerry
From: Darren Hoo [mailto:darren@gmail.com]
Sent: Wednesday, March 18, 2015 9:30 PM
To: Shao, Saisai
Cc: user@spark.apache.org; Akhil Das
Subject: Re: [spark-streaming] can shuffle write to disk be disabled?
On Wed, Mar 18, 2015 at 8:31 PM,
I am attempting to access ElasticSearch and expose it’s data through
SparkSQL using the elasticsearch-hadoop project. I am encountering the
following exception when trying to create a Temporary table from a resource
in ElasticSearch.:
15/03/18 07:54:46 INFO DAGScheduler: Job 2 finished: runJob
Unlike a map() wherein your task is acting on a row at a time, with
mapPartitions(), the task is passed the entire content of the partition in
an iterator. You can then return back another iterator as the output. I
don't do scala, but from what I understand from your code snippet... The
iterator
hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and
scanning through this archive with only moderate success. in other words --
my way of saying sorry if this is answered somewhere obvious and I missed it
:-)
i've been tasked with figuring out how to connect Notebook, Spark,
Hi David,
W00t indeed and great questions. On the notebook front, there are two
options depending on what you are looking for. You can either go with
iPython 3 with Spark-kernel as a backend or you can use spark-notebook.
Both have interesting tradeoffs.
If you have looking for a single notebook
Hi,
I am trying jdbc data source in spark sql 1.3.0 and found some issues.
First, the syntax where str_col='value' will give error for both
postgresql and mysql:
psql create table foo(id int primary key,name text,age int);
bash SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar
It depends. If the data size on which the calculation is to be done is very
large than caching it with MEMORY_AND_DISK is useful. Even in this
case MEMORY_AND_DISK
is useful if the computation on the RDD is expensive. If the compution is
very small than even for large data sets MEMORY_ONLY can be
Thanks for clarifying Todd. This may then be an issue specific to the HDP
version we're using. Will continue to debug and post back if there's any
resolution.
On Thu, Mar 19, 2015 at 3:40 AM, Todd Nist tsind...@gmail.com wrote:
Yes I believe you are correct.
For the build you may need to
Are hbase config / keytab files deployed on executor machines ?
Consider adding -Dsun.security.krb5.debug=true for debug purpose.
Cheers
On Wed, Mar 18, 2015 at 11:39 AM, Eric Walk eric.w...@perficient.com
wrote:
Having an issue connecting to HBase from a Spark container in a Secure
What's the best way to go from:
RDD[(A, B)] to (RDD[A], RDD[B])
If I do:
def separate[A, B](k: RDD[(A, B)]) = (k.map(_._1), k.map(_._2))
Which is the obvious solution, this runs two maps in the cluster. Can I do
some kind of a fold instead:
def separate[A, B](l: List[(A, B)]) =
Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza
On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mgupt...@sapient.com
wrote:
Hi,
I am running Column Similarity (All Pairs
I am attempting to access ElasticSearch and expose it’s data through
SparkSQL using the elasticsearch-hadoop project. I am encountering the
following exception when trying to create a Temporary table from a resource
in ElasticSearch.:
15/03/18 07:54:46 INFO DAGScheduler: Job 2 finished: runJob
Hi all,
I am trying to run my job which needs spark-sql_2.11-1.3.0.jar.
The cluster that I am running on is still on spark-1.2.0.
I tried the following :
spark-submit --class class-name --num-executors 100 --master yarn
application_jar--jars hdfs:///path/spark-sql_2.11-1.3.0.jar
Hi All,
I am pushing data from Kinesis stream to S3 using Spark Streaming and
noticed that during testing (i.e. master=local[2]) the batches (1 second
intervals) were falling behind the incoming data stream at about 5-10
events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
issue. Here are the logs:
15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to
Yes I believe you are correct.
For the build you may need to specify the specific HDP version of hadoop to
use with the -Dhadoop.version=. I went with the default 2.6.0, but
Horton may have a vendor specific version that needs to go here. I know I
saw a similar post today where the solution
Hey Dimitriy, thanks for sharing your solution.
I have some more updates.
The problem comes out when shuffle is involved. Using coalesce shuffle true
behaves like reduceByKey+smaller num of partitions, except that the whole
save stage hangs. I am not sure yet if it only happens with UnionRDD or
Still a Spark noob grappling with the concepts...
I'm trying to grok the idea of integrating something like the Morphlines
pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc
states that runtime executes all commands of a given morphline in the same
thread... there are no
Did you recompile it with Tachyon 0.6.0?
Also, Tachyon 0.6.1 has been released this morning:
http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases
Best regards,
Haoyuan
On Wed, Mar 18, 2015 at 11:48 AM, Ranga sra...@gmail.com wrote:
I just tested with Spark-1.3.0 +
Since you're using YARN, you should be able to download a Spark 1.3.0
tarball from Spark's website and use spark-submit from that
installation to launch your app against the YARN cluster.
So effectively you would have 1.2.0 and 1.3.0 side-by-side in your cluster.
On Wed, Mar 18, 2015 at 11:09
Thanks Ted. Will do.
On Wed, Mar 18, 2015 at 2:27 PM, Ted Yu yuzhih...@gmail.com wrote:
Ranga:
Please apply the patch from:
https://github.com/apache/spark/pull/4867
And rebuild Spark - the build would use Tachyon-0.6.1
Cheers
On Wed, Mar 18, 2015 at 2:23 PM, Ranga sra...@gmail.com
Does map(...) preserve ordering of original RDD?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-ordering-after-map-tp22129.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
What persistance level is better if RDD to be cached is heavily to be
recalculated?
Am I right it is MEMORY_AND_DISK?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html
Sent from the Apache Spark User List mailing
Hi Haoyuan
No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If
not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0?
Thanks for your help.
- Ranga
On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li haoyuan...@gmail.com wrote:
Did you recompile it with
Ranga:
Please apply the patch from:
https://github.com/apache/spark/pull/4867
And rebuild Spark - the build would use Tachyon-0.6.1
Cheers
On Wed, Mar 18, 2015 at 2:23 PM, Ranga sra...@gmail.com wrote:
Hi Haoyuan
No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If
not,
Hi,
Yes, ordering is preserved with map. Shuffles break ordering.
Burak
On Wed, Mar 18, 2015 at 2:02 PM, sergunok ser...@gmail.com wrote:
Does map(...) preserve ordering of original RDD?
--
View this message in context:
I wonder to know whether the newly-released LDA (Latent Dirichlet Allocation)
algorithm only supports uni-gram or it can also supports bi/tri-grams too?
If it can, can someone help me how I can use them?
--
View this message in context:
1 - 100 of 105 matches
Mail list logo