Hi,
Is there a way to get the elements of each cluster after running kmeans
clustering? I am using the Java version.
thanks
Hi Siddarth,
It depends on what you are trying to solve. But the connectivity for
cassandra and spark is good .
The answer depends upon what exactly you are trying to solve.
Thanks,
Vishnu
On Wed, Feb 11, 2015 at 7:47 PM, Siddharth Ubale
siddharth.ub...@syncoms.com wrote:
Hi ,
I am new
Regarding backticks: Right. You need backticks to quote the column name
timestamp because timestamp is a reserved keyword in our parser.
On Tue, Feb 10, 2015 at 3:02 PM, Mohnish Kodnani mohnish.kodn...@gmail.com
wrote:
actually i tried in spark shell , got same error and then for some reason
i
I am confused as to whether avro support was merged into Spark 1.2 or it is
still an independent library.
I see some people writing sqlContext.avroFile similarly to jsonFile but this
does not work for me, nor do I see this in the Scala docs.
--
View this message in context:
I dint mean that. When you try the above approach only one client will have
access to the cached data.
But when you expose your data through a thrift server the case is quite
different.
In the case of thrift server all the request goes to the thrift server and
spark will be able to take the
Hi Jianshi,
For YARN, there may be an issue with how a recently patch changes the
accessibility of the shuffle files by the external shuffle service:
https://issues.apache.org/jira/browse/SPARK-5655. It is likely that you
will hit this with 1.2.1, actually. For this reason I would have to
Hi,
Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a different
path then local directory.
On our cluster we have a folder for temporary files (in a central file
system), which is called /scratch.
When setting SPARK_LOCAL_DIRS=/scratch/node name
I get:
An error occurred while
Check this link.
https://github.com/databricks/spark-avro
Home page for Spark-avro project.
Thanks,
Vishnu
On Wed, Feb 11, 2015 at 10:19 PM, Todd bit1...@163.com wrote:
Databricks provides a sample code on its website...but i can't find it for
now.
At 2015-02-12 00:43:07, captainfranz
A central location, such as NFS?
If they are temporary for the purpose of further job processing you'll want
to keep them local to the node in the cluster, i.e., in /tmp. If they are
centralized you won't be able to take advantage of data locality and the
central file store will become a
Hi Siddharth,
With v 4.3 of Phoenix, you can use the PhoenixInputFormat and
OutputFormat classes to pull/push to Phoenix from Spark.
HTH
Thanks
Ravi
On Wed, Feb 11, 2015 at 6:59 AM, Ted Yu yuzhih...@gmail.com wrote:
Connectivity to hbase is also avaliable. You can take a look at:
Databricks provides a sample code on its website...but i can't find it for now.
At 2015-02-12 00:43:07, captainfranz captainfr...@gmail.com wrote:
I am confused as to whether avro support was merged into Spark 1.2 or it is
still an independent library.
I see some people writing
Thanks for the info. The file system in use is a Lustre file system.
Best,
Tassilo
On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke charles.fed...@gmail.com
wrote:
A central location, such as NFS?
If they are temporary for the purpose of further job processing you'll
want to keep them
Same here.. I am a newbie to all this as well.
But this is just what I found and I lack the expertise to figure out why
things dont work in 3.2.11 json4s.
May be some one in the group with more expertise can take a crack at it.
But this is what unblocked me from moving forward.
On Wed, Feb 11,
that explains a lot...
Is there a list of reserved keywords ?
On Wed, Feb 11, 2015 at 7:56 AM, Yin Huai yh...@databricks.com wrote:
Regarding backticks: Right. You need backticks to quote the column name
timestamp because timestamp is a reserved keyword in our parser.
On Tue, Feb 10, 2015
Take a look at this:
http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre
Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf
(linked from that article)
to get a better idea of what your options are.
If its possible to avoid writing to [any] disk I'd recommend that
cat ../hadoop/spark-install/conf/spark-env.sh
export SCALA_HOME=/home/hadoop/scala-install
export SPARK_WORKER_MEMORY=83971m
export SPARK_MASTER_IP=spark-m
export SPARK_DAEMON_MEMORY=15744m
export SPARK_WORKER_DIR=/hadoop/spark/work
export SPARK_LOCAL_DIRS=/hadoop/spark/tmp
export
You can use model.predict(point) that will help you identify the cluster
center and map it to the point.
rdd.map(x = (x,model.predict(x)))
Thanks,
Vishnu
On Wed, Feb 11, 2015 at 11:06 PM, Harini Srinivasan har...@us.ibm.com
wrote:
Hi,
Is there a way to get the elements of each cluster after
KMeansModel only returns the cluster centroids.
To get the # of elements in each cluster, try calling kmeans.predict() on each
of the points in the data used to build the model.
See
cat ../hadoop/spark-install/conf/spark-env.sh
export SCALA_HOME=/home/hadoop/scala-install
export SPARK_WORKER_MEMORY=83971m
export SPARK_MASTER_IP=spark-m
export SPARK_DAEMON_MEMORY=15744m
export SPARK_WORKER_DIR=/hadoop/spark/work
export SPARK_LOCAL_DIRS=/hadoop/spark/tmp
export
Thank you Felix and Kelvin. I think I'll def be using the k-means tools in
mlib.
It seems the best way to stream data is by storing in hbase and then using
an api in my viz to extract data? Does anyone have any thoughts on this?
Thanks!
On Tue, Feb 10, 2015 at 11:45 PM, Felix C
On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com wrote:
I was having trouble with memory exceptions when broadcasting a large lookup
table, so I've resorted to processing it iteratively -- but how can I modify
an RDD iteratively?
I'm trying something like :
rdd =
I'm trying to use the json4s library in a spark job to push data back into
kafka. Everything was working fine when I was hard coding a string, but
now that I'm trying to render a string from a simple map it's failing. The
code works in sbt console.
working console code:
Yes, the local matrix is broadcast to each worker. Here is the code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L407
In 1.3 we will have Block matrix multiplication too, which will allow
distributed matrix
You need to have right hdfs account, e.g., hdfs, to create directory and
assign permission.
Thanks.
Zhan Zhang
On Feb 11, 2015, at 4:34 AM, guxiaobo1982
guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote:
Hi Zhan,
My Single Node Cluster of Hadoop is installed by Ambari 1.7.0, I tried to
I was getting similar error after I upgraded to spark 1.2.1 from 1.1.1
Are you by any chance using json4s 3.2.11.
I downgraded to 3.2.10 and that seemed to have worked. But I didnt try to
spend much time debugging the issue than that.
On Wed, Feb 11, 2015 at 11:13 AM, Jonathan Haddad
Thanks for the reply.
I have the following Maven dependencies which looks correct to me?
Maven: org.slf4j:slf4j-log4j12:1.7.5
Maven: org.slf4j:jcl-over-slf4j:1.7.5
Maven: org.slf4j:jul-to-slf4j:1.7.5
Maven: org.slf4j:slf4j-api:1.7.5
Maven: log4j:log4j:1.2.17
At 2015-02-11 23:27:54, Ted Yu
After compiling the Spark 1.2.0 codebase in Intellj Idea, and run the LocalPi
example,I got the following slf4j related issue. Does anyone know how to fix
this? Thanks
Error:scalac: bad symbolic reference. A signature in Logging.class refers to
type Logger
in package org.slf4j which is not
You are right. I've checked the overall stage metrics and looks like the
largest shuffling write is over 9G. The partition completed successfully
but its spilled file can't be removed until all others are finished.
It's very likely caused by a stupid mistake in my design. A lookup table
grows
Hi ,
I am new to Spark . We have recently moved from Apache Storm to Apache Spark to
build our OLAP tool .
Now ,earlier we were using Hbase Phoenix.
We need to re-think what to use in case of Spark.
Should we go ahead with Hbase or Hive or Cassandra for query processing with
Spark Sql.
Please
Spark depends on slf4j 1.7.5
Please check your classpath and make sure slf4j is included.
Cheers
On Wed, Feb 11, 2015 at 6:20 AM, Todd bit1...@163.com wrote:
After compiling the Spark 1.2.0 codebase in Intellj Idea, and run the
LocalPi example,I got the following slf4j related issue. Does
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the
logs since there were other activities going on on the cluster.
From: alee...@hotmail.com
To: ar...@sigmoidanalytics.com; tsind...@gmail.com
CC: user@spark.apache.org
Subject: RE: SparkSQL + Tableau Connector
Date: Wed,
It should relay the queries to spark (i.e. you shouldn't see any MR job on
Hadoop you should see activities on the spark app on headnode UI).
Check your hive-site.xml. Are you directing to the hive server 2 port instead
of spark thrift port?
Their default ports are both 1.
From: Andrew
Hi,
Compiled the latest master of Spark yesterday (2015-02-10) for Hadoop 2.2
and failed executing jobs in yarn-cluster mode for that build. Works
successfully with spark 1.2 (and also master from 2015-01-16), so something
has changed since then that prevents the job from receiving any executors
Aha great! Thanks for the clarification!
On Feb 11, 2015 8:11 PM, Davies Liu dav...@databricks.com wrote:
On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com wrote:
I was having trouble with memory exceptions when broadcasting a large
lookup
table, so I've resorted to processing it
Thank you Costin. I wrote out to the user list, I got no replies there.
I will take this exact message and put it on the Github bug tracking system.
One quick clarification: I read the elasticsearch documentation thoroughly,
and I saw the warning about structured data vs. unstructured data, but
Bumping 1on1 conversation to mailinglist:
On 10 Feb 2015, at 13:24, Hans van den Bogert hansbog...@gmail.com wrote:
It’s self built, I can’t otherwise as I can’t install packages on the cluster
here.
The problem seems with libtool. When compiling Mesos on a host with apr-devel
and
I have ThriftServer2 up and running, however, I notice that it relays the query
to HiveServer2 when I pass the hive-site.xml to it.
I'm not sure if this is the expected behavior, but based on what I have up and
running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez
Hi Alexey and Daniel,
I'm using Spark 1.2.0 and still having the same error, as described below.
Do you have any news on this? Really appreciate your responses!!!
a Spark cluster of 1 master VM SparkV1 and 1 worker VM SparkV4 (the error
is the same if I have 2 workers). They are connected
Hello
Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Spark Sql?
Thanks
Dima
--
View this message in context:
No, only each group should need to fit.
On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet cjno...@gmail.com wrote:
Doesn't iter still need to fit entirely into memory?
On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com
wrote:
rdd.mapPartitions { iter =
val grouped =
See earlier thread:
http://search-hadoop.com/m/JW1q5BZhf92
On Wed, Feb 11, 2015 at 3:04 PM, Dima Zhiyanov dimazhiya...@gmail.com
wrote:
Hello
Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Hi Anders,
I just tried this out and was able to successfully acquire executors. Any
strange log messages or additional color you can provide on your setup?
Does yarn-client mode work?
-Sandy
On Wed, Feb 11, 2015 at 1:28 PM, Anders Arpteg arp...@spotify.com wrote:
Hi,
Compiled the latest
Hello,
I have many questions about joins, but arguably just one.
specifically about memory and containers that are overstepping their limits, as
per errors dotted around all over the place, but something like:
rdd.mapPartitions { iter =
val grouped = iter.grouped(batchSize)
for (group - grouped) { ... }
}
On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet cjno...@gmail.com wrote:
I think the word partition here is a tad different than the term
partition that we use in Spark. Basically, I want
Hello
Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Spark Sql?
Thanks
Dima
--
View this message in context:
Hello,
I have many questions about joins, but arguably just one.
specifically about memory and containers that are overstepping their limits, as
per errors dotted around all over the place, but something like:
Yes. Next release (Spark 1.3) is coming out end of Feb / early Mar.
On Wed, Feb 11, 2015 at 7:22 AM, Jianguo Li flyingfromch...@gmail.com
wrote:
Hi,
I really like the pipeline in the spark.ml in Spark1.2 release. Will
there be more machine learning algorithms implemented for the pipeline
the runtime for each consecutive iteration is still roughly twice as long as
for the previous one -- is there a way to reduce whatever overhead is
accumulating?
On Feb 11, 2015, at 8:11 PM, Davies Liu dav...@databricks.com wrote:
On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com
Thanks Ted.
On Feb 10, 2015, at 20:06, Ted Yu yuzhih...@gmail.com wrote:
Please take a look at:
examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaDirectKafkaWordCount.java
which was checked in yesterday.
On Sat, Feb 7, 2015 at 10:53 AM, Eduardo Costa Alfaia
I think the word partition here is a tad different than the term
partition that we use in Spark. Basically, I want something similar to
Guava's Iterables.partition [1], that is, If I have an RDD[People] and I
want to run an algorithm that can be optimized by working on 30 people at a
time, I'd
Hello,
I have many questions about joins, but arguably just one.
specifically about memory and containers that are overstepping their limits, as
per errors dotted around all over the place, but something like:
Doesn't iter still need to fit entirely into memory?
On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com
wrote:
rdd.mapPartitions { iter =
val grouped = iter.grouped(batchSize)
for (group - grouped) { ... }
}
On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet
Hi ,
What is the difference between SPARK_LOCAL_DIRS and SPARK_WORKER_DIR ? Also
does spark clean these up after the execution ?
Regards,
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-LOCAL-DIRS-and-SPARK-WORKER-DIR-tp21612.html
Sent from
On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar rokros...@gmail.com wrote:
the runtime for each consecutive iteration is still roughly twice as long as
for the previous one -- is there a way to reduce whatever overhead is
accumulating?
Sorry, I didn't fully understand you question, which two are
Hello,
I have many questions about joins, but arguably just one.
specifically about memory and containers that are overstepping their limits, as
per errors dotted around all over the place, but something like:
Hello
Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Spark Sql?
Thanks
Dima
--
View this message in context:
Hi,
I want to work on some use case something like below.
Just want to know if something similar has been already done which can be
reused.
Idea is to use Spark for ETL / Data Science / Streaming pipeline.
So when data comes inside the cluster front door we will do following steps
1)
Upload
Hey Todd,
I don’t have an app to test against the thrift server, are you able to define
custom SQL without using Tableau’s schema query? I guess it’s not possible to
just use SparkSQL temp tables, you may have to use permanent Hive tables that
are actually in the metastore so Tableau can
Check spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala
It can be used through sliding(windowSize: Int) in
spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala
Yuhao
From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Thursday, February 12, 2015
Hey All,
I've been playing around with the new DataFrame and ML pipelines APIs and
am having trouble accomplishing what seems like should be a fairly basic
task.
I have a DataFrame where each column is a Double. I'd like to turn this
into a DataFrame with a features column and a label column
First sorry for the long post. So back to tableau and Spark SQL, I'm still
missing something.
TL;DR
To get the Spark SQL Temp table associated with the metastore are there
additional steps required beyond doing the below?
Initial SQL on connection:
create temporary table test
using
Ah, nevermind, I just saw
http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language
integrated queries) which looks quite similar to what i was thinking
about. I'll give that a whirl...
On Wed, Feb 11, 2015 at 7:40 PM, jay vyas jayunit100.apa...@gmail.com
wrote:
Hi spark. is
Hi spark. is there anything in the works for a typesafe HQL like API for
building spark queries from case classes ? i.e. where, given a domain
object Product with a cost associated with it , we can do something
like:
query.select(Product).filter({ _.cost 50.00f
Hi,
Please can somebody help ,how to avoid Spark and Hive log from Application
log,
I mean both spark and hive are using log4j property file ,
I have configured log4j.property file as per my application as under but its
printing Spark and hive console logging also,please suggest its urgent for
me,
Hi,
Really have no adequate solution got for this issue. Expecting any available
analytical rules or hints.
Thanks,
Sun.
fightf...@163.com
From: fightf...@163.com
Date: 2015-02-09 11:56
To: user; dev
Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for
large data
What kind of data do you have? Kafka is a popular source to use with spark
streaming.
But, spark streaming also support reading from a file. Its called basic source
https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
--- Original Message ---
From:
Hi there,
I am new to spark. When training a model using K-means using the
following code, how do I obtain the cluster assignment in the next
step?
val clusters = KMeans.train(parsedData, numClusters, numIterations)
I searched around many examples but they mostly calculate the WSSSE.
I am
Hello Felix,
I am already streaming in very simple data using Kafka (few messages /
second, each record only has 3 columns...really simple, but looking to
scale once I connect everything). I am processing it in Spark Streaming and
am currently writing word counts to hdfs. So the part where I am
yes, sorry i wasn't clear -- I still have to trigger the calculation of the RDD
at the end of each iteration. Otherwise all of the lookup tables are shipped to
the cluster at the same time resulting in memory errors. Therefore this becomes
several map jobs instead of one and each consecutive
Thank you!
The Hive solution seemed more like a workaround. I was wondering if a native
Spark Sql support for computing statistics for Parquet files would be available
Dima
Sent from my iPhone
On Feb 11, 2015, at 3:34 PM, Ted Yu yuzhih...@gmail.com wrote:
See earlier thread:
Good, worth double-checking that's what you got. That's barely 1GB per
task though. Why run 48 if you have 24 cores?
On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote:
I give 50GB to the executor, so it seem that there is no reason the memory
is not enough.
On Wed, Feb 11, 2015
I just want to make the best use of CPU, and test the performance of spark
if there is a lot of task in a single node.
On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote:
Good, worth double-checking that's what you got. That's barely 1GB per
task though. Why run 48 if you
One additional comment I would make is that you should be careful with
Updates in Cassandra, it does support them but large amounts of Updates
(i.e changing existing keys) tends to cause fragmentation. If you are
(mostly) adding new keys (e.g new records in the the time series) then
Cassandra can
I forgot to mention that if you do decide to use Cassandra I'd highly
recommend jumping on the Cassandra mailing list, if we had taken in come of
the advice on that list things would have been considerably smoother
cheers
On Wed, Feb 11, 2015 at 8:12 PM, Christian Betz
Hi
Regarding the Cassandra Data model, there's an excellent post on the ebay tech
blog:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/.
There's also a slideshare for this somewhere.
Happy hacking
Chris
Von: Franc Carter
Aris, if you encountered a bug, it's best to raise an issue with the
es-hadoop/spark project, namely here [1].
When using SparkSQL the underlying data needs to be present - this is mentioned
in the docs as well [2]. As for the order,
that does look like a bug and shouldn't occur. Note the
HI,
I have a row matrix x
scala x
res3: org.apache.spark.mllib.linalg.distributed.RowMatrix =
org.apache.spark.mllib.linalg.distributed.RowMatrix@63949747
and I would like to apply a function to each element of this matrix. I was
looking for something like:
x map (e = exp(-e*e))
How can I do
Hi Arush,
So yes I want to create the tables through Spark SQL. I have placed the
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that
was all I should need to do to have the thriftserver use it. Perhaps my
hive-site.xml is worng, it currently looks like this:
Hi Aplysia,
Thanks for the reply.
Could you be more specific in terms of what part of the document to look at
as I have already seen it and tried a few of the relevant settings for no
use.
--
View this message in context:
[cid:image004.jpg@01D04629.1F451950] [cid:image005.png@01D04629.1F451950]
Hi guys~
Comparing these two architectures, why BDAS put Yarn and Mesos under the HDFS,
do you have any special consideration? Or just easy to express the AMPLab stack?
Best regards!
That kinda dodges the problem by ignoring generic types. But it may be
simpler than the 'real' solution, which is a bit ugly.
(But first, to double check, are you importing the correct
TextOutputFormat? there are two versions. You use .mapred. with the
old API and .mapreduce. with the new API.)
Hi,
I run the kmeans(MLlib) in a cluster with 12 workers. Every work own
a 128G RAM, 24Core. I run 48 task in one machine. the total data is just
40GB.
When the dimension of the data set is about 10^7, for every task the
duration is about 30s, but the cost for GC is about 20s.
When I
Hello guys,
I am trying to run a Ramdom Forest on 30MB of data. I have a cluster of 4
machines. Each machine has 106 MB of RAM and 16 cores.
I am getting:
15/02/11 11:01:23 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-3] shutting down
Hi, I want to include if possible Kryo serialization in a project and
first I'm trying to run FlumeEventCount with Kryo. If I comment setAll
method, runs correctly, but if I use Kryo params it returns several errors.
15/02/11 11:42:16 ERROR SparkDeploySchedulerBackend: Asked to remove
Hi Spark users,
I seem to be having this consistent error which I have been trying to reproduce
and narrow down the problem. I've been running a PySpark application on Spark
1.2 reading avro files from Hadoop. I was consistently seeing the following
error:
py4j.protocol.Py4JJavaError: An
I want to create/access the hive tables from spark.
I have placed the hive-site.xml inside the spark/conf directory. Even
though it creates a local metastore in the directory where I run the spark
shell and exists with an error.
I am getting this error when I try to create a new hive table. Even
I also forgot some other information. I have made this error go away by making
my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but communicate
with a spark 1.2 master and worker. It's not a good workaround, so I would like
to have the driver also be spark 1.2
Michael
Dear all,
I am new to Spark SQL and have no experience of Hive.
I tried to use the built-in Hive Function to extract the hour from
timestamp in spark sql, but got : java.util.NoSuchElementException: key
not found: hour
How should I extract the hour from timestamp?
And I am very confusing about
As far as from my tests, language integrated query in spark isn't type safe, ie.
query.where('cost == foo)
Would compile and return nothing.
If you want type safety, perhaps you want to map the SchemaRDD to a RDD of
Product (your type, not scala.Product)
--- Original Message ---
From: jay
Thanks everyone for your responses. I'll definitely think carefully about
the data models, querying patterns and fragmentation side-effects.
Cheers, Mike.
On Wed, Feb 11, 2015 at 1:14 AM, Franc Carter franc.car...@rozettatech.com
wrote:
I forgot to mention that if you do decide to use
It sounds like you probably want to do a standard Spark map, that results
in a tuple with the structure you are looking for. You can then just
assign names to turn it back into a dataframe.
Assuming the first column is your label and the rest are features you can
do something like this:
val df
I think there is a minor error here in that the first example needs a
tail after the seq:
df.map { row =
(row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double]))
}.toDataFrame(label, features)
On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust
mich...@databricks.com wrote:
It sounds like
I was able to resolve this use case (Thanks Cheng Lian) where I wanted to
launch executor on just the specific partition while also getting the batch
pruning optimisations of Spark SQL by doing following :-
val query = sql(SELECT * FROM cac
hedTable WHERE key = 1)
val plannedRDD =
Try increasing the value of spark.yarn.executor.memoryOverhead. It’s default
value is 384mb in spark 1.1. This error generally comes when your process usage
exceed your max allocation. Use following property to increase memory overhead.
From: Yifan LI
Just read the thread Are these numbers abnormal for spark streaming? and
I think I am seeing similar results - that is - increasing the window seems
to be the trick here. I will have to monitor for a few hours/days before I
can conclude (there are so many knobs/dials).
On Wed, Feb 11, 2015 at
Did you have a look at
http://spark.apache.org/docs/1.2.0/building-spark.html
I think you can simply download the source and build for your hadoop
version as:
mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package
Thanks
Best Regards
On Thu, Feb 12, 2015 at 11:45 AM, Michael
On Spark 1.2 (have been seeing this behaviour since 1.0), I have a
streaming app that consumes data from Kafka and writes it back to Kafka
(different topic). My big problem has been Total Delay. While execution
time is usually window size (in seconds), the total delay ranges from a
minutes to
Hi Zhan,
Yes, I found there is a hdfs account, which is created by Ambari, but what's
the password for this account, how can I login under this account?
Can I just change the password for the hdfs account?
Regards,
-- Original --
From: Zhan
98 matches
Mail list logo