Hi ,all
I have get a Matrix type result with java , But i don't know how to save
the result to a file.
"Matrix cov = mat.computeCovariance();"
THX.
Thank you guys for the answers.
@Ted Yu: You are right, in general the code to fetch stuff externally
should be called separately, while Spark should only access the data
written by these two services via flume/kafka/whatever. However, before I
get there, I would like to have the Spark job ready.
Hi Andy,
The equation to calculate IDF is:
idf = log((m + 1) / (d(t) + 1))
you can refer here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L150
The equation to calculate TFIDF is:
TFIDF=TF * IDF
you can refer:
How to set different executor memory limits for different worker nodes?
I'm using spark 1.5.2 in standalone deployment mode and launching using
scripts. The executor memory is set via 'spark.executor.memory' in
conf/spark-defaults.conf. This sets the same memory limit for all the worker
nodes. I
Hi,
I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.
When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.
Is
Hi Spark users,
when I want to map the result of count on groupBy, I need to convert the
result to Dataframe, then change the column names and map the result to new
case class, Why Spark Datatset API doesn't have direct functionality?
case class LogRow(id: String, location: String, time: Long)
Hi
Do I need to install spark on all the yarn cluster node if I want to submit
the job to yarn client?
is there any way exists in which I can spawn a spark job executors on the
cluster nodes where I have not installed spark.
Thanks
Sanjeev
Hi, Mich,
Building Spark with SparkR profile enabled requires installation of R on your
building machine.
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, January 19, 2016 5:27 AM
To: Mich Talebzadeh
Cc: user @spark
Subject: Re: building spark 1.6 throws error Rscript: command not found
Did you try SparkContext.getOrCreate() ?
You don't need to pass the sparkContext to the map function, you can
retrieve it from the SparkContext singleton.
Regards,
Ricardo
On Mon, Jan 18, 2016 at 6:29 PM, gpatcham [via Apache Spark User List] <
ml-node+s1001560n25998...@n3.nabble.com> wrote:
Hi,
Certain API's (map, mapValues) give the developer access to the data stored
in RDD's.
Am I correct in saying that these API's must never modify the data but
always return a new object with a copy of the data if the data needs to be
updated for the returned RDD.
Thanks,
Dave.
--
View this
Do you have any plans for supporting hive transactions in Spark?
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Tuesday, January 19, 2016 3:18 PM
To: hnagar
Cc: user
Subject: Re: Spark SQL -Hive transactions support
We
Hi, there is one question. In spark-env.sh should i specify all masters for
parameter SPARK_MASTER_IP. I've set SPARK_DAEMON_JAVA_OPTS already with
zookeeper configuration as specified in spark documentation.
Thanks & Regards
Raghvendra
On Wed, Jan 20, 2016 at 1:46 AM, Raghvendra Singh <
Hi Tim
Do you have any materials/blog for running Spark in a container in Mesos
cluster environment? I have googled it but couldn't find info on it. Spark
documentation says it is possible, but no details provided.. Please help
Thanks
Sathish
On Mon, Sep 21, 2015 at 11:54 AM Tim Chen
You could try increase the driver memory by "--driver-memory", looks like
the OOM is came from driver side, so the simple solution is to increase the
memory of driver.
On Tue, Jan 19, 2016 at 1:15 PM, Julio Antonio Soto wrote:
> Hi,
>
> I'm having trouble when uploadig spark
thanks Daniel, this will certainly help,
regards, Richard
On Tue, Jan 19, 2016 at 6:35 PM, Daniel Imberman
wrote:
> edit 2: filter should be map
>
> val numColumns = separatedInputStrings.map{ case(id, (stateList,
> numStates)) => numStates}.reduce(math.max)
>
> On
Hi,
I'm having trouble when uploadig spark jobs in yarn-cluster mode. While the
job works and completes in yarn-client mode, I hit the following error when
using spark-submit in yarn-cluster (simplified):
16/01/19 21:43:31 INFO hive.metastore: Connected to metastore.
16/01/19 21:43:32 WARN
Hello,
I was wondering if there exists some elegant way to build a fully connected
grid-graph.
The standard grid-graph function only creates one, where a vertex is
connected to the vertices of row+1 and column+1. I need for my algorithm
that every vertex is connected to the vertices of row-1,
Hi Sathish,
Sorry about that, I think that's a good idea and I'll write up a section in
the Spark documentation page to explain how it can work. We (Mesosphere)
have been doing this for our DCOS spark for our past releases and has been
working well so far.
Thanks!
Tim
On Tue, Jan 19, 2016 at
I have a set of log files I would like to read into an RDD. These files
are all compressed .gz and are the filenames are date stamped. The source
of these files is the page view statistics data for wikipedia
http://dumps.wikimedia.org/other/pagecounts-raw/
The file names look like this:
Hive's ACID feature (which introduces transactions) is not required for
inserts, only updates and deletes. Inserts should be supported on a vanilla
Hive shell. I'm not sure how Spark interacts with Hive in that regard but
perhaps the HiveSQLContext implementation is lacking support.
On a separate
Hi,
How dataframe (What API) can access hive complex type (Struct, Array, Maps)?
Thanks,
Patcharee
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hive has transactions support since version 0.14.
I am using Spark 1.6, and Hive 1.2.1, are transactions supported in Spark
SQL now. I tried in the Spark-Shell and it gives the following error
org.apache.spark.sql.AnalysisException:
Unsupported language features in query: insert into test
We don't support Hive style transaction.
On Tue, Jan 19, 2016 at 11:32 AM, hnagar wrote:
> Hive has transactions support since version 0.14.
>
> I am using Spark 1.6, and Hive 1.2.1, are transactions supported in Spark
> SQL now. I tried in the Spark-Shell and it
Hi Oleg,
The Tachyon related issue should be fixed.
Hope this helps,
Calvin
On Mon, Jan 18, 2016 at 2:51 AM, Oleg Ruchovets
wrote:
> Hi ,
>I try to follow the spartk 1.6.0 to install spark on EC2.
>
> It doesn't work properly - got exceptions and at the end
Here's the complete master log on reproducing the error
http://pastebin.com/2YJpyBiF
Regards
Raghvendra
On Wed, Jan 20, 2016 at 12:38 AM, Raghvendra Singh <
raghvendra.ii...@gmail.com> wrote:
> Ok I Will try to reproduce the problem. Also I don't think this is an
> uncommon problem I am
In Spark 2.0 we are planning to combine DataFrame and Dataset so that all
the methods will be available on either class.
On Tue, Jan 19, 2016 at 3:42 AM, Milad khajavi wrote:
> Hi Spark users,
>
> when I want to map the result of count on groupBy, I need to convert the
>
The occasional type error if the casting goes wrong for whatever reason.
2016-01-19 1:22 GMT+08:00 Michael Armbrust :
> What error?
>
> On Mon, Jan 18, 2016 at 9:01 AM, Simon Hafner wrote:
>>
>> And for deserializing,
>>
On 18 Jan 2016, at 23:39, 李振 >
wrote:
: java.io.IOException: java.net.ConnectException: Connection refused
at
org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:888)
at
You might need hive-site.xml
_
From: Peter Zhang
Sent: Monday, January 18, 2016 9:08 PM
Subject: Re: SparkR with Hive integration
To: Jeff Zhang
Cc:
Thanks,
I will try.
So is the logging to Cassandra being done via Spark ?
On Wed, Jan 13, 2016 at 7:17 AM, Dennis Birkholz
wrote:
> Hi together,
>
> we Cassandra to log event data and process it every 15 minutes with Spark.
> We are using the Cassandra Java Connector for Spark.
>
> Randomly
It is not scanning the HBase. What it is doing is looping through your list
of Row keys and fetching data for each 1 at a time.
Ex: Your solr result has 5 records, with Row Keys R1...R5.
Then list will be [R1,R2,...R5]
Then table.get(list) will do something like:
res=[]
for k in list:
v =
get(List gets) will call:
Object [] r1 = batch((List)gets);
where batch() would do:
AsyncRequestFuture ars = multiAp.submitAll(pool, tableName, actions,
null, results);
ars.waitUntilDone();
multiAp is an AsyncProcess.
In short, client would access region server for the results.
Hi,
I tried with --driver-memory 16G (more than enough to read a simple parquet
table), but the problem still persists.
Everything works fine in yarn-client.
--
Julio Antonio Soto de Vicente
> El 19 ene 2016, a las 22:18, Saisai Shao escribió:
>
> You could try
Thank you! Looking forward for it..
On Tue, Jan 19, 2016 at 4:03 PM Tim Chen wrote:
> Hi Sathish,
>
> Sorry about that, I think that's a good idea and I'll write up a section
> in the Spark documentation page to explain how it can work. We (Mesosphere)
> have been doing
I also would be interested in some best practice for making this work.
Where will the writeup be posted? On mesosphere website?
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Sathish Kumaran Vairavelu
Date: 01/19/2016
Hi ,
I want to load really large volumn datasets from mysql using spark dataframe
api. And then save as
parquet file or orc file to facilitate that with hive / Impala. The datasets
size is about 1 billion records and
when I am using the following naive code to run that , Error occurs and
Hi
I trying to integrated Hbase-solr-spark.
Solr is indexing all the documents from Hbase through hbase-indexer .
Through the Spark I am manipulating all datasets .Thing is after getting
the solrdocuments from the solr query ,it has the rowkey and rowvalues .So
directly i got the rowkeys and
Thanks Ted, :)
if everything gets indexed from Hbase into solr ,then no need to trace
Regionservers once again
Thanks
Beesh
On Wed, Jan 20, 2016 at 5:05 AM, Ted Yu wrote:
> get(List gets) will call:
>
> Object [] r1 = batch((List)gets);
>
> where batch() would
I want to execute a program in a distributed environment without using hadoop
and only in spark cluster. What is the best way to use it?
--
View this message in context:
You can use the sc.newAPIHadoopFile and pass your own InputFormat and
RecordReader which will read the compressed .gz files to your usecase. For
a start, you can look at the:
- wholeTextFile implementation
It seems Spark is not able to serialize your function code to worker nodes.
I have tried to put a solution in simple set of commands. Maybe you can
combine last four line into function.
val arr = Array((1,"A","<20","0"), (1,"A",">20 & <40","1"), (1,"B",">20 &
<40","0"), (1,"C",">20 & <40","0"),
If you are processing a file, then you can keep the same file in all
machines in the same location and everything should work.
Thanks
Best Regards
On Wed, Jan 20, 2016 at 11:15 AM, Kamaruddin wrote:
> I want to execute a program in a distributed environment without
Hi,
Just a thought. Can we use Spark Job Server and trigger jobs through rest
apis. In this case, all jobs will share same context and run the jobs
parallel.
If any one has other thoughts please share
Regards,
Rajesh
On Tue, Jan 19, 2016 at 10:28 PM, emlyn wrote:
> We
Hi,
Thanks a lot for your suggestion. I then tried the following code :
val prop = new java.util.Properties
prop.setProperty("user","test")
prop.setProperty("password", "test")
prop.setProperty("partitionColumn", "added_year")
prop.setProperty("lowerBound", "1985")
Hi,
I suggest you partition the JDBC reading on a indexed column of the mysql
table
2016-01-20 10:11 GMT+08:00 fightf...@163.com :
> Hi ,
> I want to load really large volumn datasets from mysql using spark
> dataframe api. And then save as
> parquet file or orc file to
Hi all,
I am joining two tables with common columns using full outer join. However,
the current Dataframe API doesn't support nature joins, so the output
contains redundant common columns from both of the tables.
Is there any way to remove these redundant columns for a "nature" full
outer join?
Did you try re-partitioning the data before doing the write?
Thanks
Best Regards
On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
pavel.plotni...@team.wrike.com> wrote:
> Hello,
> I'm using spark on some machines in standalone mode, data storage is
> mounted on this machines via nfs. A have
Hi John,
I recently deployed Redis instances using Kubernetes framework on Apache Mesos.
Kubernetes uses POD concept and you can run your requirements (Redis/Spark)
as a docker container and also adds up some of the HA concepts to the instances.
Cheers
Nagaraj C
From: Darren Govoni
I have to run this to install the pre-req to get jeykyll build to work, you do
need the python pygments package:
(I’m on ubuntu)sudo apt-get install ruby ruby-dev make gcc nodejssudo gem
install jekyll --no-rdoc --no-risudo gem install jekyll-redirect-fromsudo
apt-get install
It depends on what you mean by "write access". The RDDs are immutable, so
you can't really change them. When you apply a mapping/filter/groupBy
function, you are creating a new RDD starting from the original one.
Kind regards,
Marco
2016-01-19 13:27 GMT+01:00 Dave :
>
Hi,
I am working with Spark in Java on top of a HDFS cluster. In my code two
RDDs are partitioned with the same partitioner (HashPartitioner with the
same number of partitions), so they are co-partitioned.
Thus same keys are on the same partitions' number but that does not mean
that both RDDs are
It's a good question. You can easily imagine an RDD of classes that
are mutable. Yes, if you modify these objects, the result is pretty
undefined, so don't do that.
On Tue, Jan 19, 2016 at 12:27 PM, Dave wrote:
> Hi Marco,
>
> Yes, that answers my question. I just wanted
Hello,
I'm using spark on some machines in standalone mode, data storage is
mounted on this machines via nfs. A have input data stream and when i'm
trying to store all data for hour in parquet, a job executes mostly on one
core and this hourly data are stored in 40- 50 minutes. It is very slow!
Thanks Sean.
On 19/01/16 13:36, Sean Owen wrote:
It's a good question. You can easily imagine an RDD of classes that
are mutable. Yes, if you modify these objects, the result is pretty
undefined, so don't do that.
On Tue, Jan 19, 2016 at 12:27 PM, Dave wrote:
Hi
Hello,
RDD are immutable by design. The reasons, to quote Sean Owen in this answer
( https://www.quora.com/Why-is-a-spark-RDD-immutable ), are the following :
Immutability rules out a big set of potential problems due to updates from
> multiple threads at once. Immutable data is definitely safe
Can you give me more details on Spark's jobserver.
Regards,
Praveen
On 18 Jan 2016 03:30, "Jia" wrote:
> I guess all jobs submitted through JobServer are executed in the same JVM,
> so RDDs cached by one job can be visible to all other jobs executed later.
> On Jan 17,
Hi Marco,
Yes, that answers my question. I just wanted to be sure as the API gave
me write access to the immutable data which means its up to the
developer to know not to modify the input parameters for these API's.
Thanks for the response.
Dave.
On 19/01/16 12:25, Marco wrote:
Hello,
RDD
Is cacheTable similar to asTempTable before?
Sent from my iPhone
> On 19 Jan, 2016, at 4:18 am, George Sigletos wrote:
>
> Thanks Kevin for your reply.
>
> I was suspecting the same thing as well, although it still does not make much
> sense to me why would you need
Hi, Praveen, have you checked out this, which might have the details you need:
https://spark-summit.org/2014/wp-content/uploads/2014/07/Spark-Job-Server-Easy-Spark-Job-Management-Chan-Chu.pdf
Best Regards,
Jia
On Jan 19, 2016, at 7:28 AM, praveen S wrote:
> Can you give
The most efficient to determine the number of columns would be to do a
take(1) and split in the driver.
Regards
Sab
On 19-Jan-2016 8:48 pm, "Richard Siebeling" wrote:
> Hi,
>
> what is the most efficient way to split columns and know how many columns
> are created.
>
>
Dear all,
Can I configure Spark on multiple nodes without HDFS, so that output data
will be written to the local file system on each node?
I guess there is no such feature in Spark, but just want to confirm.
Best Regards,
Jia
Hi,
Can we create dummy variables for categorical variables in sparkR like we
do using "dummies" package in R
--
Warm regards,
Devesh.
Hi,
what is the most efficient way to split columns and know how many columns
are created.
Here is the current RDD
-
ID STATE
-
1 TX, NY, FL
2 CA, OH
-
This is the preferred output:
-
IDSTATE_1 STATE_2
Hi,
I'm using Spark in standalone mode without HDFS, and shared folder is
mounted on nodes via nfs. It looks like each node write data like in local
file system.
Regards,
Pavel
On Tue, Jan 19, 2016 at 5:39 PM Jia Zou wrote:
> Dear all,
>
> Can I configure Spark on
Hi,
No,you don't need to.
However,when submitting jobs certain resources will be uploaded to
hdfs,which could be a performance issue
read the log and you will understand:
15/12/29 11:10:06 INFO Client: Uploading resource
file:/data/spark/spark152/lib/spark-assembly-1.5.2-hadoop2.6.0.jar -> hdfs
Hi Richard,
If I understand the question correctly it sounds like you could probably do
this using mapValues (I'm assuming that you want two pieces of information
out of all rows, the states as individual items, and the number of states
in the row)
val separatedInputStrings = input:RDD[(Int,
edit: Mistake in the second code example
val numColumns = separatedInputStrings.filter{ case(id, (stateList,
numStates)) => numStates}.reduce(math.max)
On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman
wrote:
> Hi Richard,
>
> If I understand the question correctly it
Yes, you can share RDDs with Tachyon, while keeping the data in memory.
Spark jobs can write to a Tachyon path (tachyon://host:port/path/) and
other jobs can read from the same path.
Here is a presentation that includes that use case:
that's true and that's the way we're doing it now but then we're only using
the first row to determine the number of splitted columns.
It could be that in the second (or last) row there are 10 new columns and
we'd like to know that too.
Probably a reduceby operator can be used to do that, but I'm
Hi,
I have the following pair RDD created in java.
JavaPairRDD progRef =
sc.textFile(programReferenceDataFile, 12).filter(
(String s) -> !s.startsWith("#")).mapToPair(
(String s) -> {
We have a Spark application that runs a number of ETL jobs, writing the
outputs to Redshift (using databricks/spark-redshift). This is triggered by
calling DataFrame.write.save on the different DataFrames one after another.
I noticed that during the Redshift load while the output of one job is
Yes, you can use Rformula library. Please see
https://databricks.com/blog/2015/10/05/generalized-linear-models-in-sparkr-and-r-formula-support-in-mllib.html
On Tue, Jan 19, 2016 at 10:34 AM, Devesh Raj Singh
wrote:
> Hi,
>
> Can we create dummy variables for categorical
edit 2: filter should be map
val numColumns = separatedInputStrings.map{ case(id, (stateList,
numStates)) => numStates}.reduce(math.max)
On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman
wrote:
> edit: Mistake in the second code example
>
> val numColumns =
73 matches
Mail list logo