You could try flatMapping i.e. if you have data : RDD[(key,
Iterable[Vector])] then data.flatMap(_._2) : RDD[Vector], which can be
GMMed.
If you want to first partition by url, I would first create multiple RDDs
using `filter`, then running GMM on each of the filtered rdds.
On Tue, Aug 11, 2015
I am also trying to understand how are files named when writing to hadoop?
for eg: how does saveAs method ensures that each executor is generating
unique files?
On Tue, Aug 11, 2015 at 4:21 PM, ayan guha guha.a...@gmail.com wrote:
partitioning - by itself - is a property of RDD. so essentially
Yan:
Where can I find performance numbers for Astro (it's close to middle of
August) ?
Cheers
On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no
design doc mentioned. Superficially it is very
Hi Francis,
From my observation when using spark sql, dataframe.limit(n) does not
necessarily return the same result each time when running Apps.
To be more precise, in one App, the result should be same for the same n,
however, changing n might not result in the same prefix(the result for n =
After changing the '--deploy_mode client' the program seems to work
however it looks like there is a bug in spark when using --deploy_mode as
'yarn'. Should I open a bug?
On Tue, Aug 11, 2015 at 3:02 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I see the following line in the log 15/08/11
How does partitioning in spark work when it comes to streaming? What's the
best way to partition a time series data grouped by a certain tag like
categories of product video, music etc.
Refreshing table only works for the Spark SQL DataSource in my understanding,
apparently here, you’re running a Hive Table.
Can you try to create a table like:
|CREATE TEMPORARY TABLE parquetTable (a int, b string)
|USING org.apache.spark.sql.parquet.DefaultSource
hi community,
i have build a spark and flink k-means application.
my test case is a clustering on 1 million points on 3node cluster.
in memory bottlenecks begins flink to outsource to disk and work slowly but
works.
however spark lose executers if the memory is full and starts again
(infinety
To be part of a strongly connected component every vertex must be reachable
from every other vertex. Vertex 6 is not reachable from the other components
of scc 0. Same goes for 7. So both 6 and 7 form their own strongly connected
components. 6 and 7 are part of the connected components of 0 and 3
Hello all,
As a quick follow up for this, I have been using Spark on Yarn till now and
am currently exploring Mesos and Marathon. Using yarn, we could tell the
spark job about the number of executors and number of cores as well, is
there a way to do it on mesos? I'm using Spark 1.4.1 on Mesos
Pa:
Can you try 1.5.0 SNAPSHOT ?
See SPARK-7075 Project Tungsten (Spark 1.5 Phase 1)
Cheers
On Tue, Aug 11, 2015 at 12:49 AM, jun kit...@126.com wrote:
your detail of log file?
At 2015-08-10 22:02:16, Pa Rö paul.roewer1...@googlemail.com wrote:
hi community,
i have build a spark and
You can do something like this:
val fStream = ssc.textFileStream(/sigmoid/data/)
.map(x = {
try{
//Move all the transformations within a try..catch
}catch{
case e: Exception = { logError(Whoops!! ); null }
}
})
Thanks
Best Regards
On Mon, Aug 10, 2015 at 7:44 PM, Mario Pastorelli
Hi –
I'd like to follow up on this, as I am running into very similar issues (with a
much bigger data set, though – 10^5 nodes, 10^7 edges).
So let me repost the question: Any ideas on how to estimate graphx memory
requirements?
Cheers!
Von: Roman Sokolov [mailto:ole...@gmail.com]
Gesendet:
Hi Tim,
Spark on Yarn allows us to do it using --num-executors and --executor_cores
commandline arguments. I just got a chance to look at a similar spark user
list mail, but no answer yet. So does mesos allow setting the number of
executors and cores? Is there a default number it assumes?
On
Hi,
I have been trying to use spark for the processing I need to do in some
logs, and I have found several difficulties during the process. Most of
them I could overcome them, but I am really stuck in the last one.
I would really like to know how spark is supposed to be deployed. For now,
I have
Hi,
I have successfully reduced my data and store it in JavaDStreamBSONObject
Now, i want to save this data in mongodb for this i have used BSONObject
type.
But, when i try to save it, it is giving exception.
For this, i also try to save it just as *saveAsTextFile *but same exception.
Error
Consider the spark.max.cores configuration option -- it should do what you
require.
On Tue, Aug 11, 2015 at 8:26 AM, Haripriya Ayyalasomayajula
aharipriy...@gmail.com wrote:
Hello all,
As a quick follow up for this, I have been using Spark on Yarn till now
and am currently exploring Mesos
Load data to where? To spark? If you are referring to spark, then there are
some differences the way the connector is implemented. When you use spark,
the most important thing that you get is the parallelism (depending on the
number of partitions). If you compare it with a native java driver then
my first post is here and a log too:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201508.mbox/%3ccah2_pykqhfr4tbvpbt2tdhgm+zrkcbzfnk7uedkjpdhe472...@mail.gmail.com%3E
i use cloudera live, i think i can not use spark 1.5.
i will try to run it again and post the current logfile here.
okay.
Then do you have any idea how to avoid this error?
Thanks
On Tue, Aug 11, 2015 at 12:08 AM, Tathagata Das t...@databricks.com wrote:
I think this may be expected. When the streaming context is stopped
without the SparkContext, then the receivers are stopped by the driver. The
receiver
Like this: (Including the filter function)
JavaPairInputDStreamLongWritable, Text inputStream = ssc.fileStream(
testDir.toString(),
LongWritable.class,
Text.class,
TextInputFormat.class,
new FunctionPath, Boolean() {
@Override
public Boolean call(Path
Hi,
I have been trying to use spark for the processing I need to do in some
logs, and I have found several difficulties during the process. Most of
them I could overcome them, but I am really stuck in the last one.
I would really like to know how spark is supposed to be deployed. For now,
I have
You can create a new Issue and send a pull request for the same i think.
+ dev list
Thanks
Best Regards
On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote:
Dear Sir / Madam,
I have a plan to contribute some codes about passing filters to a
datasource as physical
Hi,
The issue is very likely to be in the data or the transformations you
apply, rather than anything to do with the Spark Kmeans API as such. I'd
start debugging by doing a bit of exploratory analysis of the TFIDF
vectors. That is, for instance, plot the distribution (histogram) of the
TFIDF
HBase will not have query engine.
It will provide better support to query engines.
Cheers
On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:
Ted,
I’m in China now, and seem to experience difficulty to access Apache Jira.
Anyways, it appears to me that
I have two dataframes like this
student_rdf = (studentid, name, ...)
student_result_rdf = (studentid, gpa, ...)
we need to join this two dataframes. we are now doing like this,
student_rdf.join(student_result_rdf, student_result_rdf[studentid]
== student_rdf[studentid])
So it is simple.
There's a daily quota and a minutely quota, you could be hitting those. You
can ask google to increase the quota for the particular service. Now, to
reduce the limit from the spark side, you can actually to a re-partition to
a smaller number before doing the save. Another way to use the local file
Hi all,
I'm using Spark 1.4.1. I create a DataFrame from json file. There is
a column C that all values are null in the json file. I found that
the datatype of column C in the created DataFrame is string. However,
I would like to specify the column as Long when saving it as parquet
file. What
*HI,*
Please let me know if i am missing anything in the command below
*Command:*
dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld
--jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar
///home/missingmerch/dse.jar
use --verbose, it might give you some insights on what0s happening,
[image: Fon] http://www.fon.com/Javier Domingo CansinoResearch
Development Engineer+34 946545847Skype: javier.domingo.fonAll information
in this email is confidential http://corp.fon.com/legal/email-disclaimer
On Tue, Aug
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU
and the other executor with 6. So I don't think you can easily configure it
without some tweaking at the source code.
Sent from my iPad
On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula aharipriy...@gmail.com
I have no real idea (not java user), but have you tried with the --jars
option?
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
AFAIK, you are currently submitting the jar names as arguments to the
called Class instead of the jars themselves
Is there more information about spark evenlog, for example
Why did not appear the SparkListenerExecutorRemoved event in evenlog while i
use dynamic executor?
I want to calculate cpu elapsed time of an application base on evenlog.
By the way, Do you have any other method to get cpu elapsed time
Hi all,
I don't have any hadoop fs installed on my environment, but I would like to
store dataframes in parquet files. I am failing to do so, if possible, anyone
have any pointers?
Thank you,
Saif
I am launching my spark-shell
spark-1.4.1-bin-hadoop2.6/bin/spark-shell
15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
scala val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF
scala data.write.parquet(/var/ data/Saif/pq)
I found some discussions online, but it all cpome to advice to use JDF 1.7 (or
1.8).
Well, I use JDK 1.7 on OS X Yosemite . Both
java –verion //
java version 1.7.0_80
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
and
echo
What does the following command say ?
mvn -version
Maybe you are using an old maven ?
Cheers
On Tue, Aug 11, 2015 at 7:55 AM, Yakubovich, Alexey
alexey.yakubov...@searshc.com wrote:
I found some discussions online, but it all cpome to advice to use JDF 1.7
(or 1.8).
Well, I use JDK 1.7 on
HI,
Please find the log details below:
dse spark-submit --verbose --master local --class HelloWorld
etl-0.0.1-SNAPSHOT.jar --jars
file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
file:/home/missingmerch/dse.jar
file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
Using properties file:
Jerry,
I was able to use window functions without the hive thrift server. HiveContext
does not imply that you need the hive thrift server running.
Here’s what I used to test this out:
var conf = new SparkConf(true).set(spark.cassandra.connection.host,
127.0.0.1)
val sc = new
I confirm that it works,
I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450
Saif
From: Ellafi, Saif A.
Sent: Tuesday, August 11, 2015 12:01 PM
To: Ellafi, Saif A.; deanwamp...@gmail.com
Cc: user@spark.apache.org
Subject: RE: Parquet without hadoop: Possible?
Sorry,
Hello everyone,
I am trying to use PySpark API with window functions without specifying
partition clause. I mean something equivalent to this
SELECT v, row_number() OVER (ORDER BY v) AS rn FROM df
in SQL. I am not sure if I am doing something wrong or it is a bug but
results are far from what I
I forgot to mention, my setup was:
- Spark 1.4.1 running in standalone mode
- Datastax spark cassandra connector 1.4.0-M1
- Cassandra DB
- Scala version 2.10.4
From: Benjamin Ross
Sent: Tuesday, August 11, 2015 10:16 AM
To: Jerry; Michael Armbrust
Cc: user
Hi,
After taking a look at the code, I found out the problem:
As spark will use broadcastNestedLoopJoin to treat nonequality condition.
And one of my dataframe(df1) is created from an existing RDD(logicalRDD),
so it uses defaultSizeInBytes * length to estimate the size. The other
dataframe(df2)
Hi
My Spark job (running in local[*] with spark 1.4.1) reads data from a
thrift server(Created an RDD, it will compute the partitions in
getPartitions() call and in computes hasNext will return records from these
partitions), count(), foreach() is working fine it returns the correct
number of
It should work fine. I have an example script here:
https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala
(Spark 1.4.X)
What does I am failing to do so mean?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
Spark evolved as an example framework for Mesos - thats how I know it. It
is surprising to see that the options provided by mesos in this case are
less. Tweaking the source code, haven't done it yet but I would love to see
what options could be there!
On Tue, Aug 11, 2015 at 5:42 AM, Jerry Lam
Sorry, I provided bad information. This example worked fine with reduced
parallelism.
It seems my problem have to do with something specific with the real data frame
at reading point.
Saif
From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
Sent: Tuesday, August 11, 2015
What if processing is neither idempotent nor its in transaction ,say I am
posting events to some external server after processing.
Is it possible to get accumulator of failed task in retry task? Is there
any way to detect whether this task is retried task or original task ?
I was trying to
Michel Robert
Almaden Research Center
EDA - IBM Systems and Technology Group
Phone: (408) 927-2117 T/L 8-457-2117
E-mail: m...@us.ibm.com
I found some discussions online, but it all cpome to advice to use JDF 1.7
(or 1.8).
Well, I use JDK 1.7 on OS X Yosemite . Both
java –verion //
java version 1.7.0_80
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
and
echo
Just out of curiosity, what is the advantage of using parquet without hadoop?
Sent from my iPhone
On 11 Aug, 2015, at 11:12 am, saif.a.ell...@wellsfargo.com wrote:
I confirm that it works,
I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450
Saif
From:
You need to configure the spark.sql.shuffle.partitions parameter to a different
value. It defaults to 200.
On 8/11/15, 11:31 AM, Al M alasdair.mcbr...@gmail.com wrote:
I am using DataFrames with Spark 1.4.1. I really like DataFrames but the
partitioning makes no sense to me.
I am loading
See first section of http://spark.apache.org/community.html
On Tue, Aug 11, 2015 at 9:47 AM, Michel Robert m...@us.ibm.com wrote:
Michel Robert
Almaden Research Center
EDA - IBM Systems and Technology Group
Phone: (408) 927-2117 T/L 8-457-2117
E-mail: m...@us.ibm.com
I am running Spark 1.3 on CDH 5.4 stack. I am getting the following error
when I spark-submit my application:-
15/08/11 16:03:49 INFO Remoting: Starting remoting
15/08/11 16:03:49 INFO Remoting: Remoting started; listening on addresses
We use Talend, but not for Spark workflows.
Although it does have Spark componenets.
https://www.talend.com/download/talend-open-studio
It is free (commercial support available), easy to design and deploy
workflows.
Talend for BigData 6.0 was released as month ago.
Is anybody using Talend for
On 10 Aug 2015, at 20:17, Akshat Aranya
aara...@gmail.commailto:aara...@gmail.com wrote:
Hi Jerry, Akhil,
Thanks your your help. With s3n, the entire file is downloaded even while just
creating the RDD with sqlContext.read.parquet(). It seems like even just
opening and closing the
We are in the middle of figuring that out. At the high level, we want to
combine the best parts of existing workflow solutions.
On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone vikramk...@gmail.com wrote:
Hien,
Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin
going to
Hi all,
silly question. Does logging info messages, both print or to file, or event
logging, cause any impact to general performance of spark?
Saif
Do you think it might be faster to put all the files in one directory but
still partitioned the same way? I don't actually need to filter on the
values of the partition keys, but I need to rely on there be no overlap in
the value of the keys between any two parquet files.
On Fri, Aug 7, 2015 at
Can you please mention the output for the following :
java -version
javac -version
Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42
data/task nodes, which runs with BigInsight V3.0.0.2, corresponding with Hadoop
2.2.0 with MR1.
Since IBM BigInsight doesn't come with Spark, so we build Spark 1.2.2 with
Hadoop 2.2.0 + Hive 0.12 by ourselves, and
What level of logging are you looking at ?
At INFO level, there shouldn't be noticeable difference.
On Tue, Aug 11, 2015 at 12:24 PM, saif.a.ell...@wellsfargo.com wrote:
Hi all,
silly question. Does logging info messages, both print or to file, or
event logging, cause any impact to general
I see the following line in the log 15/08/11 17:59:12 ERROR
spark.SparkContext: Jar not found at
file:/home/ec2-user/./spark-streaming-test-0.0.1-SNAPSHOT-jar-with-dependencies.jar,
however I do see that this file exists on all the node in that path. Not
sure what's happening here. Please note I
If I have an RDD that happens to already be partitioned by a key, how
efficient can I expect a groupBy operation to be? I would expect that Spark
shouldn't have to move data around between nodes, and simply will have a
small amount of work just checking the partitions to discover that it
doesn't
Hi,
I'm new to Scala, Spark and PySpark and have a question about what approach
to take in the problem I'm trying to solve.
I have noticed that working with HBase tables read in using
`newAPIHadoopRDD` can be quite slow with large data sets when one is
interested in only a small subset of the
Philip,
If all data per key are inside just one partition, then Spark will figure that
out. Can you guarantee that’s the case?
What is it you try to achieve? There might be another way for it, when you
might be 100% sure what’s happening.
You can print debugString or explain (for DataFrame) to
You should create key as tuple type. In your case, RDD[((id, timeStamp) ,
value)] is the proper way to do.
Kevin
--- Original Message ---
Sender : swethaswethakasire...@gmail.com
Date : 2015-08-12 09:37 (GMT+09:00)
Title : What is the optimal approach to do Secondary Sort in Spark?
Hi,
Can you share a query or stack trace? More information would make this
question easier to answer.
On Tue, Aug 11, 2015 at 8:50 PM, Ravisankar Mani rrav...@gmail.com wrote:
Hi all,
We got an exception like
“org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call
to
I also tend to agree that Azkaban is somehqat easier to get set up. Though I
haven't used the new UI for Oozie that is part of CDH, so perhaps that is
another good option.
It's a pity Azkaban is a little rough in terms of documenting its API, and the
scalability is an issue. However it
Hi Josh
Please ignore the last mail stack trace. Kindly refer the exception details.
{org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call
to dataType on unresolved object, tree: 'Sheet1.Teams}
Regards,
Ravi
On Wed, Aug 12, 2015 at 1:34 AM, Ravisankar Mani
Hi Eric,
This is likely because you are putting the parameter after the primary
resource (latest_msmtdt_by_gridid_and_source.py), which makes it a
parameter to your application instead of a parameter to Spark/
-Sandy
On Wed, Aug 12, 2015 at 4:40 AM, Eric Bless eric.bl...@yahoo.com.invalid
Hi,
What is the optimal approach to do Secondary sort in Spark? I have to first
Sort by an Id in the key and further sort it by timeStamp which is present
in the value.
Thanks,
Swetha
--
View this message in context:
Hi Rosan,
Thanks for your response. Kindly refer the following query and stack trace.
I have checked same query in hive, It works perfectly. In case i have
removed in in where class, it works in spark
SELECT If(ISNOTNULL(SUM(`Sheet1`.`Runs`)),SUM(`Sheet1`.`Runs`),0) AS
`Sheet1_Runs`
Forgot to mention. Here is how I run the program :
./bin/spark-submit --conf spark.app.master=local[1]
~/workspace/spark-python/ApacheLogWebServerAnalysis.py
On Wednesday, 12 August 2015 10:28 AM, Spark Enthusiast
sparkenthusi...@yahoo.in wrote:
I wrote a small python program :
def
partitioning - by itself - is a property of RDD. so essentially it is no
different in case of streaming where each batch is one RDD. You can use
partitionBy on RDD and pass on your custom partitioner function to it.
One thing you should consider is how balanced are your partitions ie your
That's a good question, we don't support reading small files in a single
partition yet, but it's definitely an issue we need to optimize, do you mind to
create a jira issue for this? Hopefully we can merge that in 1.6 release.
200 is the default partition number for parallel tasks after the
Is the source of your dataframe partitioned on key? As per your mail, it
looks like it is not. If that is the case, for partitioning the data, you
will have to shuffle the data anyway.
Another part of your question is - how to co-group data from two dataframes
based on a key? I think for RDD's
Thanks.
In my particular case, I am calculating a distinct count on a key that is
unique to each partition, so I want to calculate the distinct count within
each partition, and then sum those. This approach will avoid moving the
sets of that key around between nodes, which would be very
I import the spark project into intellij, and try to run SparkPi in
intellij, but failed with compilation error:
Error:scalac:
while compiling:
/Users/werere/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
during phase: jvm
library version:
Posting a comment from my previous mail post:
When data is received from a stream source, receiver creates blocks of
data. A new block of data is generated every blockInterval milliseconds. N
blocks of data are created during the batchInterval where N =
batchInterval/blockInterval. A RDD is
Thanks for the info. When data is written in hdfs how does spark keeps the
filenames written by multiple executors unique
On Tue, Aug 11, 2015 at 9:35 PM, Hemant Bhanawat hemant9...@gmail.com
wrote:
Posting a comment from my previous mail post:
When data is received from a stream source,
I wrote a small python program :
def parseLogs(self):
Read and parse log file
self._logger.debug(Parselogs() start)
self.parsed_logs = (self._sc
.textFile(self._logFile)
.map(self._parseApacheLogLine)
.cache())
Hi
Trying to access GPU from a Spark 1.4.0 Docker slave, without much luck. In my
Spark program, I make a system call to a script, which performs various
calculations using GPU. I am able to run this script as standalone, or via
Mesos Marathon; however, calling the script through Spark fails
15/08/11 12:59:34 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
3.0 (TID 71, sdldalplhdw02.suddenlink.cequel3.com):
java.lang.NullPointerException
at
com.suddenlink.pnm.process.HBaseStoreHelper.flush(HBaseStoreHelper.java:313)
It's your app error. NPE from HBaseStoreHelper
Logs would be helpful to diagnose. Could you attach the logs ?
On Wed, Aug 12, 2015 at 5:19 AM, java8964 java8...@hotmail.com wrote:
The executor's memory is reset by --executor-memory 24G for spark-shell.
The one from the spark-env.sh is just for default setting.
I can confirm from the
Hi all,
We got an exception like
“org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call
to dataType on unresolved object” when using some where condition queries.
I am using 1.4.0 version spark. Is this exception resolved in latest spark?
Regards,
Ravi
Definitely worth to try. And you can sort the record before writing out, and
then you will get the parquet files without overlapping keys.
Let us know if that helps.
Hao
From: Philip Weaver [mailto:philip.wea...@gmail.com]
Sent: Wednesday, August 12, 2015 4:05 AM
To: Cheng Lian
Cc: user
Hi im currently using a pregel message passing function for my graph in spark
and graphx. The problem i have is that the code runs perfectly on spark 1.0
and finishes in a couple of minutes but as we have upgraded now im trying to
run the same code on 1.3 but it doesnt finish (left it overnight
88 matches
Mail list logo