ConnectionManager has been deprecated and is no longer used by default
(NettyBlockTransferService is the replacement). Hopefully you would no
longer see these messages unless you have explicitly flipped it back on.
On Tue, Aug 4, 2015 at 6:14 PM, Jim Green openkbi...@gmail.com wrote:
And also
Hi,
Lots of streaming internal status are exposed through StreamingListener, as
well as what see from web UI, so you could write your own StreamingListener
and register in StreamingContext to get the internal information of Spark
Streaming and write to CSV file.
You could check the source code
using coalesce might be dangerous, since 1 worker process will need to
handle whole file and if the file is huge you'll get OOM, however it
depends on implementation, I'm not sure how it will be done
nevertheless, worse to try the coallesce method(please post your results)
another option would be
Hi,
As spark job is executed when you run start() method of
JavaStreamingContext.
All the job like map, flatMap is already defined earlier but even though
you put breakpoints in the function ,breakpoint doesn't stop there , then
how can i debug the spark jobs.
JavaDStreamString
Hi,
I was recently looking into spark graphx as one of the frameworks that can
help me solve some graph related problems.
The 'think-like-a-vertex' paradigm is something new to me and I cannot wrap
my head over how to implement simple algorithms like Depth First or
Breadth First or even getting
Thanks Steve and Michael for your response.
Is there a tentative release date for Spark 1.5?
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Tuesday, August 4, 2015 11:53 PM
To: Steve Loughran ste...@hortonworks.com
Cc: Ishwardeep Singh ishwardeep.si...@impetus.co.in;
Hi everyone,
After a successfull prediction, MulticlassMetrics returns an error for
unable to find the key for the class for precision.
Is there a way to check whether the metrics contains the class key?
--
Hayri Volkan Agun
PhD. Student - Anadolu University
seems that coallesce do work, see following thread
https://www.mail-archive.com/user%40spark.apache.org/msg00928.html
On 5 August 2015 at 09:47, Igor Berman igor.ber...@gmail.com wrote:
using coalesce might be dangerous, since 1 worker process will need to
handle whole file and if the file is
Hi
i've done the twitter streaming using twitter's streaming user api and spark
streaming. this runs successfully on my local machine. but when i run this
program on cluster in local mode. it just run successfully for the very
first time. later on it gives the following exception.
Exception in
No one help me... I help myself, I split the cluster to two cluster 1.4.1
and 1.3.0
-- --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??8??4??(??) 10:28
??: Igor Bermanigor.ber...@gmail.com;
: Sea261810...@qq.com; Barak
The results in MulticlassMetrics is totally wrong. They are improperly
calculated.
Confusion matrix may be true I don't know but for each label scores are
wrong.
--
Hayri Volkan Agun
PhD. Student - Anadolu University
Hi,
Thanks a lot for your reply.
It seems that it is because of the slowness of the second code.
I rewrite code as list(set([i.items for i in a] + [i.items for i in b])).
The program returns normal.
By the way, I find that when the computation is running, UI will show
scheduler delay. However,
Hi,
What Spark version do you use? it looks like a problem of configuration
recovery, not sure is it a twitter streaming specific problem, I tried
Kafka streaming with checkpoint enabled in my local machine, seems no such
issue. Did you try to set these configurations in somewhere?
Thanks
Saisai
Deepesh,
you have to call an action to start actual processing.
words.count() would do the trick.
On 05 Aug 2015, at 11:42, Deepesh Maheshwari deepesh.maheshwar...@gmail.com
wrote:
Hi,
As spark job is executed when you run start() method of JavaStreamingContext.
All the job like map,
Hi I have the following code which fires hiveContext.sql() most of the time.
My task is I want to create few table and insert values into after
processing for all hive table partition. So I first fire show partitions and
using its output in a for loop I call few methods which creates table if not
Hello,
I would love to have hive merge the small files in my managed hive context
after every query. Right now, I am setting the hive configuration in my
Spark Job configuration but hive is not managing the files. Do I need to
set the hive fields in around place? How do you set Hive
Hi,
I'm receiving a memory allocation error with a recent build of Spark 1.5:
java.io.IOException: Unable to acquire 67108864 bytes of memory
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:348)
at
Thanks Akash for the answer. I added endpoint to the listener and now it is
working.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-Twitter-Input-from-Kafka-to-Spark-Streaming-tp24131p24142.html
Sent from the Apache Spark User List mailing list archive
On 5 Aug 2015, at 02:08, Ishwardeep Singh
ishwardeep.si...@impetus.co.inmailto:ishwardeep.si...@impetus.co.in wrote:
Thanks Steve and Michael for your response.
Is there a tentative release date for Spark 1.5?
The branch is of and going through its bug stabilisation/test phase.
This is where
On 4 Aug 2015, at 12:26, Sean Owen so...@cloudera.com wrote:
Oh good point, does the Windows integration need native libs for
POSIX-y file system access? I know there are some binaries shipped for
this purpose but wasn't sure if that's part of what's covered in the
native libs message.
On
Hi Hayri,
Can you provide a sample of the expected and actual results?
Feynman
On Wed, Aug 5, 2015 at 6:19 AM, Hayri Volkan Agun volkana...@gmail.com
wrote:
The results in MulticlassMetrics is totally wrong. They are improperly
calculated.
Confusion matrix may be true I don't know but for
Also, what version of Spark are you using?
On Wed, Aug 5, 2015 at 9:57 AM, Feynman Liang fli...@databricks.com wrote:
Hi Hayri,
Can you provide a sample of the expected and actual results?
Feynman
On Wed, Aug 5, 2015 at 6:19 AM, Hayri Volkan Agun volkana...@gmail.com
wrote:
The results
This feature isn't currently supported.
On Wed, Aug 5, 2015 at 8:43 AM, Brandon White bwwintheho...@gmail.com
wrote:
Hello,
I would love to have hive merge the small files in my managed hive context
after every query. Right now, I am setting the hive configuration in my
Spark Job
In Spark 1.5, we have a new way to manage memory (part of Project
Tungsten). The default unit of memory allocation is 64MB, which is way too
high when you have 1G of memory allocated in total and have more than 4
threads.
We will reduce the default page size before releasing 1.5. For now, you
So there is no good way to merge spark files in a manage hive table right
now?
On Wed, Aug 5, 2015 at 10:02 AM, Michael Armbrust mich...@databricks.com
wrote:
This feature isn't currently supported.
On Wed, Aug 5, 2015 at 8:43 AM, Brandon White bwwintheho...@gmail.com
wrote:
Hello,
I
We tested Spark 1.2 and 1.3 , and this issue is gone. I know starting from
1.2, Spark uses netty instead of nio.
So you mean that bypass this issue?
Another question is , why this error message did not show in Spark 0.9 or
older version?
On Tue, Aug 4, 2015 at 11:01 PM, Aaron Davidson
Hello, here's a simple program that demonstrates my problem:
ssc = StreamingContext(sc, 1)
input = [ [(k1,12), (k2,14)], [(k1,22)] ]
rawData = ssc.queueStream([sc.parallelize(d, 1) for d in input])
runningRawData = rawData.updateStateByKey(lambda nv, prev: reduce(sum, nv,
prev or 0))
def
Hi,
Is it possible to start the Spark SQL thrift server from with a streaming app
so the streamed data could be queried as it's goes in ?
Thank you.
Daniel
I was wondering when one should go for MLib or SparkR. What is the criteria
or what should be considered before choosing either of the solutions for
data analysis?
or What is the advantages of Spark MLib over Spark R or advantages of
SparkR over MLib?
SparkR doesn't support all the ML algorithms yet, the next 1.5 will have more
support but still not all the algorithms that are currently supported in Mllib.
SparkR is more of a convenience for R users to get acquainted with Spark at
this point.
-Ali
From: praveen S
A few points to consider:
a) SparkR gives the union of R_in_a_single_machine and the
distributed_computing_of_Spark:
b) It also gives the ability to wrangle with data in R, that is in the
Spark eco system
c) Coming to MLlib, the question is MLlib and R (not MLlib or R) -
depending on the scale,
I'm trying to load a 1 Tb file whose lines i,j,v represent the values of a
matrix given as A_{ij} = v so I can convert it to a Parquet file. Only some
of the rows of A are relevant, so the following code first loads the
triplets are text, splits them into Tuple3[Int, Int, Double], drops triplets
What machine learning algorithms are you interested in exploring or using?
Start from there or better yet the problem you are trying to solve, and
then the selection may be evident.
On Wednesday, August 5, 2015, praveen S mylogi...@gmail.com wrote:
I was wondering when one should go for MLib
Hello,
My Spark application is written in Scala and submitted to a Spark cluster
in standalone mode. The Spark Jobs for my application are listed in the
Spark UI like this:
Job Id Description ...
6 saveAsTextFile at Foo.scala:202
5 saveAsTextFile at Foo.scala:201
4
SparkContext#setJobDescription or SparkContext#setJobGroup
On Wed, Aug 5, 2015 at 12:29 PM, Rares Vernica rvern...@gmail.com wrote:
Hello,
My Spark application is written in Scala and submitted to a Spark cluster
in standalone mode. The Spark Jobs for my application are listed in the
Spark
How big is droprows?
Try explicitly broadcasting it like this:
val broadcastDropRows = sc.broadcast(dropRows)
val valsrows = ...
.filter(x = !broadcastDropRows.value.contains(x._1))
- Philip
On Wed, Aug 5, 2015 at 11:54 AM, AlexG swift...@gmail.com wrote:
I'm trying to load a 1 Tb file
Hello,
I am trying to define an external Hive table from Spark HiveContext like the
following:
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext(sc)
hiveCtx.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS Rentrak_Ratings (Version
string, Gen_Date string, Market_Number
1.5 has not yet been released; what is the commit hash that you are
building?
On Wed, Aug 5, 2015 at 10:29 AM, Hayri Volkan Agun volkana...@gmail.com
wrote:
Hi,
In Spark 1.5 I saw a result for precision 1.0 and recall 0.01 for decision
tree classification.
While precision a hundred the
I've been using SparkConf on my project for quite some time now to store
configuration information for its various components. This has worked very
well thus far in situations where I have control over the creation of the
SparkContext the SparkConf.
I have run into a bit of a problem trying to
It seems you want to dedupe your data after the merge so set(a+b) should
also work..you may ditch the list comprehensiion operation.
On 5 Aug 2015 23:55, gen tang gen.tan...@gmail.com wrote:
Hi,
Thanks a lot for your reply.
It seems that it is because of the slowness of the second code.
I
I have csv data that is embedded in gzip format on HDFS.
*With Pig*
a = load
'/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-3.gz' using
PigStorage();
b = limit a 10
This message means that java.util.Date is not supported by Spark DataFrame.
You'll need to use java.sql.Date, I believe.
On Wed, Aug 5, 2015 at 8:29 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
That seem to be working. however i see a new exception
Code:
def formatStringAsDate(dateStr:
Hi,
My Spark set up is a cluster on top of mesos using docker containers.
I want to pull the docker images from a private repository (currently gcr.io
),
and I can't get the authentication to work.
I know how to generate a .dockercfg file (running on GCE, using gcloud
docker -a).
My problem is
Thanks Tathagata. I tried that but BlockGenerator internally uses
SystemClock which is again private.
We are using DSE so stuck with Spark 1.2 hence can't use the receiver-less
version. Is it possible to use the same code as a separate API with 1.2?
Thanks,
Sourabh
On Wed, Aug 5, 2015 at 6:13
Thanks!
On Wed, Aug 5, 2015 at 5:24 PM, Saisai Shao sai.sai.s...@gmail.com wrote:
Yes, finally shuffle data will be written to disk for reduce stage to
pull, no matter how large you set to shuffle memory fraction.
Thanks
Saisai
On Thu, Aug 6, 2015 at 7:50 AM, Muler
Please see the comments at the tail of SPARK-2356
Cheers
On Wed, Aug 5, 2015 at 6:04 PM, Ashish Dutt ashish.du...@gmail.com wrote:
*Use Case:* To automate the process of data extraction (HDFS), data
analysis (pySpark/sparkR) and saving the data back to HDFS
programmatically.
*Prospective
Have you tried reading the spark documentation?
http://spark.apache.org/docs/latest/programming-guide.html
Thank you,
Ilya Ganelin
-Original Message-
From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.commailto:deepuj...@gmail.com]
Sent: Thursday, August 06, 2015 12:41 AM Eastern Standard Time
Absolutely, thanks!
On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian lian.cs@gmail.com wrote:
We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396
Could you give it a shot to see whether it helps in your case? We've
observed ~50x performance boost with schema merging turned
We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396
Could you give it a shot to see whether it helps in your case? We've
observed ~50x performance boost with schema merging turned on.
Cheng
On 8/6/15 8:26 AM, Philip Weaver wrote:
I have a parquet directory that was
Hi,
You can try This Kafka Consumer for Spark which is also part of Spark
Packages . https://github.com/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Thu, Aug 6, 2015 at 6:48 AM, Sourabh Chandak sourabh3...@gmail.com
wrote:
Thanks Tathagata. I tried that but BlockGenerator internally
Hi
For checkpointing and using fromOffsets arguments- Say for the first time
when my app starts I don't have any prev state stored and I want to start
consuming from largest offset
1. is it possible to specify that in fromOffsets api- I don't want to use
another api which returs
You could very easily strip out the BlockGenerator code from the Spark
source code and use it directly in the same way the Reliable Kafka Receiver
uses it. BTW, you should know that we will be deprecating the receiver
based approach for the Direct Kafka approach. That is quite flexible, can
give
The parallelize method does not read the contents of a file. It simply
takes a collection and distributes it to the cluster. In this case, the
String is a collection 67 characters.
Use sc.textFile instead of sc.parallelize, and it should work as you want.
On Wed, Aug 5, 2015 at 8:12 PM, ÐΞ€ρ@Ҝ
That seem to be working. however i see a new exception
Code:
def formatStringAsDate(dateStr: String) = new
SimpleDateFormat(-MM-dd).parse(dateStr)
//(2015-07-27,12459,,31242,6,Daily,-999,2099-01-01,2099-01-02,1,0,0.1,0,1,-1,isGeo,,,204,694.0,1.9236856708701322E-4,0.0,-4.48,0.0,0.0,0.0,)
val
how do i persist the RDD to HDFS ?
On Wed, Aug 5, 2015 at 8:32 PM, Philip Weaver philip.wea...@gmail.com
wrote:
This message means that java.util.Date is not supported by Spark
DataFrame. You'll need to use java.sql.Date, I believe.
On Wed, Aug 5, 2015 at 8:29 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)
Code:
val summary = rowStructText.map(s = s.split(,)).map(
{
s =
Summary(formatStringAsDate(s(0)),
s(1).replaceAll(\, ).toLong,
s(3).replaceAll(\, ).toLong,
s(4).replaceAll(\, ).toInt,
s(5).replaceAll(\, ),
Hi,
Consider I'm running WordCount with 100m of data on 4 node cluster.
Assuming my RAM size on each node is 200g and i'm giving my executors 100g
(just enough memory for 100m data)
1. If I have enough memory, can Spark 100% avoid writing to disk?
2. During shuffle, where results have to
Hi Muler,
Shuffle data will be written to disk, no matter how large memory you have,
large memory could alleviate shuffle spill where temporary file will be
generated if memory is not enough.
Yes, each node writes shuffle data to file and pulled from disk in reduce
stage from network framework
Hi, I have a question about sampling Spark Streaming data, or getting part of
the data. For every minute, I only want the data read in during the first 10
seconds, and discard all data in the next 50 seconds. Is there any way to
pause reading and discard data in that period? I'm doing this to
Hi, I have a question about sampling Spark Streaming data, or getting part of
the data. For every minute, I only want the data read in during the first 10
seconds, and discard all data in the next 50 seconds. Is there any way to pause
reading and discard data in that period? I'm doing this to
thanks, so if I have enough large memory (with enough spark.shuffle.memory)
then shuffle (in-memory shuffle) spill doesn't happen (per node) but still
shuffle data has to be ultimately written to disk so that reduce stage
pulls if across network?
On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao
Hi to all,
Im trying to use some windows functions (ntile and percentRank) for a
Dataframe but i dont know how to use them.
Does anyone can help me with this please? in the Python API documentation
there are no examples about it.
In specific, im trying to get quantiles of a numeric field in my
Hi Danniel,
It is possible to create an instance of the SparkSQL Thrift server, however
seems like this project is what you may be looking for:
https://github.com/Intel-bigdata/spark-streamingsql
Not 100% sure of your use case is, but you can always convert the data into
DF then issue a query
Hi DB Tsai-2,
I am trying to run singleton sparkcontext in my container (spring-boot
tomcat container). When my application bootstrap I used to create
sparkContext and keep the reference for future job submission. I got it
working with standalone spark perfectly but I am having trouble with yarn
Yes, finally shuffle data will be written to disk for reduce stage to pull,
no matter how large you set to shuffle memory fraction.
Thanks
Saisai
On Thu, Aug 6, 2015 at 7:50 AM, Muler mulugeta.abe...@gmail.com wrote:
thanks, so if I have enough large memory (with enough
spark.shuffle.memory)
I have a parquet directory that was produced by partitioning by two keys,
e.g. like this:
df.write.partitionBy(a, b).parquet(asdf)
There are 35 values of a, and about 1100-1200 values of b for each
value of a, for a total of over 40,000 partitions.
Before running any transformations or actions
Hi Dimitris,
Thanks for your reply. Just wondering – are you asking about my streaming input
source? I implemented a custom receiver and have been using that. Thanks.
From: Dimitris Kouzis - Loukas look...@gmail.commailto:look...@gmail.com
Date: Wednesday, August 5, 2015 at 5:27 PM
To: Heath
Use Case: I want to use my laptop (using Win 7 Professional) to connect to
the CentOS 6.4 master server using PyCharm.
Objective: To write the code in Pycharm on the laptop and then send the job
to the server which will do the processing and should then return the
result back to the laptop or to
Hi,
I am trying to replicate the Kafka Streaming Receiver for a custom version
of Kafka and want to create a Reliable receiver. The current implementation
uses BlockGenerator which is a private class inside Spark streaming hence I
can't use that in my code. Can someone help me with some resources
*Use Case:* To automate the process of data extraction (HDFS), data
analysis (pySpark/sparkR) and saving the data back to HDFS
programmatically.
*Prospective solutions:*
1. Create a remote server connectivity program in an IDE like pyCharm or
RStudio and use it to retrieve the data from HDFS or
70 matches
Mail list logo