Hi,
Any comments please.
Regards,Laeeq
On Friday, April 17, 2015 11:37 AM, Laeeq Ahmed
laeeqsp...@yahoo.com.INVALID wrote:
Hi,
I am working with multiple Kafka streams (23 streams) and currently I am
processing them separately. I receive one stream from each topic. I have the
I'm trying to find out how to setup a resilient Spark cluster.
Things I'm thinking about include:
- How to start multiple masters on different hosts?
- there isn't a conf/masters file from what I can see
Thank you.
Hi,
My data looks like this:
+---++--+
| col_name | data_type | comment |
+---++--+
| cust_id | string | |
| part_num | int|
It looks like you’re creating 23 actions in your job (one per DStream). As
far as I know by default Spark Streaming executes only one job at a time.
So your 23 actions are executed one after the other. Try setting
spark.streaming.concurrentJobs to something higher than one.
iulian
On Fri, Apr
S3a isn't ready for production use on anything below Hadoop 2.7.0. I say that
as the person who mentored in all the patches for it between Hadoop 2.6 2.7
you need everything in https://issues.apache.org/jira/browse/HADOOP-11571 in
your code
-Hadoop 2.6.0 doesn't have any of the HADOOP-11571
# of tasks = # of partitions, hence you can provide the desired number of
partitions to the textFile API which should result a) in a better spatial
distribution of the RDD b) each partition will be operated upon by a separate
task
You can provide the number of p
-Original Message-
You can probably try the Low Level Consumer from spark-packages (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) .
How many partitions are there for your topics ? Let say you have 10 topics
, and each having 3 partition , ideally you can create max 30 parallel
Receiver and 30
You can resort to Serialized storage (still in memory) of your RDDs - this
will obviate the need for GC since the RDD elements are stored as serialized
objects off the JVM heap (most likely in Tachion which is distributed in
memory files system used by Spark internally)
Also review the Object
Hi Andrew, according to you we should balance the time when gc run and the
batch time, which rdd is processed?
On Fri, Apr 24, 2015 at 6:58 AM Reza Zadeh r...@databricks.com wrote:
Hi Andrew,
The .principalComponents feature of RowMatrix is currently constrained to
tall and skinny matrices.
On Thu, Apr 23, 2015 at 6:09 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have seen multiple blogs stating to use reduceByKey instead of
groupByKey. Could someone please help me in converting below code to use
reduceByKey
Code
some spark processing
...
Below
val
If I run spark in stand-alone mode ( not YARN mode ), is there any tool like
Sqoop that able to transfer data from RDBMS to spark storage?
Thanks
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
What is the specific usecase? I can think of couple of ways (write to hdfs
and then read from spark or stream data to spark). Also I have seen people
using mysql jars to bring data in. Essentially you want to simulate
creation of rdd.
On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com
I use sudo pip install ... for each machine in cluster. And don't think how
submit library
On Fri, Apr 24, 2015 at 4:21 AM dusts66 dustin.davids...@gmail.com wrote:
I am trying to figure out python library management. So my question is:
Where do third party Python libraries(ex. numpy, scipy,
Yes Akhil. This is the same issue. I have updated my comment in that ticket.
Thanks
Sourabh
On Fri, Apr 24, 2015 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Isn't this related to this
https://issues.apache.org/jira/browse/SPARK-6681
Thanks
Best Regards
On Fri, Apr 24, 2015
Hello!
I know that HadoopRDD partitions are built based on the number of splits
in HDFS. I'm wondering if these partitions preserve the initial order of
data in file.
As an example, if I have an HDFS (myTextFile) file that has these splits:
split 0- line 1, ..., line k
split 1-line k+1,...,
this is not about gc issue itself. The memory is
On Friday, April 24, 2015, Evo Eftimov evo.efti...@isecc.com wrote:
You can resort to Serialized storage (still in memory) of your RDDs – this
will obviate the need for GC since the RDD elements are stored as
serialized objects off the JVM heap
zipwithIndex will preserve the order whatever is there in your val lines.
I am not sure about the val lines=sc.textFile(hdfs://mytextFile) if
this line maintain the order, next will maintain for sure
On 24 April 2015 at 18:35, Spico Florin spicoflo...@gmail.com wrote:
Hello!
I know that
I did a quick test as I was curious about it too. I created a file with
numbers from 0 to 999, in order, line by line. Then I did:
scala val numbers = sc.textFile(./numbers.txt)
scala val zipped = numbers.zipWithUniqueId
scala zipped.foreach(i = println(i))
Expected result if the order was
I have tested sortByKey method with the following code and I have observed
that is triggering a new job when is called. I could find this in the
neither in API nor in the code. Is this an indented behavior? For example,
the RDD zipWithIndex method API specifies that will trigger a new job. But
Yes, I think this is a known issue, that sortByKey actually runs a job
to assess the distribution of the data.
https://issues.apache.org/jira/browse/SPARK-1021 I think further eyes
on it would be welcome as it's not desirable.
On Fri, Apr 24, 2015 at 9:57 AM, Spico Florin spicoflo...@gmail.com
Yin:
Fix Version of SPARK-4520 is not set.
I assume it was fixed in 1.3.0
Cheers
Fix Version
On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai yh...@databricks.com wrote:
The exception looks like the one mentioned in
https://issues.apache.org/jira/browse/SPARK-4520. What is the version of
Spark?
Sorry for my explanation, my English is bad. I just need obtain the Long
containing of the DStream created by messages.count(). Thanks for all.
2015-04-24 20:00 GMT+02:00 Sean Owen so...@cloudera.com:
Do you mean an RDD? I don't think it makes sense to ask if the DStream
has data; it may have
oh, I missed that. It is fixed in 1.3.0.
Also, Jianshi, the dataset was not generated by Spark SQL, right?
On Fri, Apr 24, 2015 at 11:09 AM, Ted Yu yuzhih...@gmail.com wrote:
Yin:
Fix Version of SPARK-4520 is not set.
I assume it was fixed in 1.3.0
Cheers
Fix Version
On Fri, Apr 24,
But if a use messages.count().print this show a single number :/
2015-04-24 20:22 GMT+02:00 Sean Owen so...@cloudera.com:
It's not a Long. it's an infinite stream of Longs.
On Fri, Apr 24, 2015 at 2:20 PM, Sergio Jiménez Barrio
drarse.a...@gmail.com wrote:
It isn't the sum. This is de
It's mostly manual. You could try automating with something like Chef, of
course, but there's nothing already available in terms of automation.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
On top of what's been said...
On Wed, Apr 22, 2015 at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
1) I can go to Spark UI and see the status of the APP but cannot see the
logs as the job progresses. How can i see logs of executors as they progress
?
Spark 1.3 should have links to the
The sum? you just need to use an accumulator to sum the counts or something.
On Fri, Apr 24, 2015 at 2:14 PM, Sergio Jiménez Barrio
drarse.a...@gmail.com wrote:
Sorry for my explanation, my English is bad. I just need obtain the Long
containing of the DStream created by messages.count().
For #1, click on a worker node on the YARN dashboard. From there,
Tools-Local logs-Userlogs has the logs for each application, and you can
view them by executor even while an application is running. (This is for
Hadoop 2.4, things may have changed in 2.6.)
-Sven
On Thu, Apr 23, 2015 at 6:27 AM,
No, it prints each Long in that stream, forever. Have a look at the DStream API.
On Fri, Apr 24, 2015 at 2:24 PM, Sergio Jiménez Barrio
drarse.a...@gmail.com wrote:
But if a use messages.count().print this show a single number :/
On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin van...@cloudera.com
wrote:
Spark 1.3 should have links to the executor logs in the UI while the
application is running. Not yet in the history server, though.
You're absolutely correct -- didn't notice it until now. This is a great
addition!
How can one disable *Partition discovery* in *Spark 1.3.0 * when using
*sqlContext.parquetFile*?
Alternatively, is there a way to load /.parquet/ files without *Partition
discovery*?
-
https://www.linkedin.com/in/cosmincatalinsanda
--
View this message in context:
The order of elements in an RDD is in general not guaranteed unless
you sort. You shouldn't expect to encounter the partitions of an RDD
in any particular order.
In practice, you probably find the partitions come up in the order
Hadoop presents them in this case. And within a partition, in this
If you're reading a file one by line then you should simply use Java's Hadoop
FileSystem class to read the file with a BuffereInputStream. I don't think you
need an RDD here.
Sent with Good (www.good.com)
-Original Message-
From: Michal Michalski
Hi Sergio,
I missed this thread somehow... For the error case classes cannot have
more than 22 parameters., it is the limitation of scala (see
https://issues.scala-lang.org/browse/SI-7296). You can follow the
instruction at
Thanks that's why I was worried and tested my application again :).
On 24 April 2015 at 23:22, Michal Michalski michal.michal...@boxever.com
wrote:
Yes.
Kind regards,
Michał Michalski,
michal.michal...@boxever.com
On 24 April 2015 at 17:12, Jeetendra Gangele gangele...@gmail.com wrote:
Anyone who can guide me how to reduce the Size from Long to Int since I
dont need Long index.
I am huge data and this index talking 8 bytes, if i can reduce it to 4
bytes will be great help?
On 22 April 2015 at 22:46, Jeetendra Gangele gangele...@gmail.com wrote:
Sure thanks. if you can guide
Hi,
I selected a starter task in JIRA, and made changes to my github fork of
the current code.
I assumed I would be able to build and test.
% mvn clean compile was fine
but
%mvn package failed
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:2.18:test (default-test)
Hi Reza,
I’m trying to identify groups of similar variables, with the ultimate goal of
reducing the dimensionality of the dataset. I believe SVD would be sufficient
for this, although I also tried running RowMatrix.computeSVD and observed the
same behavior: frequent task failures, with
Hi Burak,
Thanks for this insight. I’m curious to know, how did you reach the conclusion
that GC pauses were to blame? I’d like to gather some more diagnostic
information to determine whether or not I’m facing a similar scenario.
~ Andrew
From: Burak Yavuz [mailto:brk...@gmail.com]
Sent:
Any ideas on this? Any sample code to join 2 data frames on two columns?
Thanks
Ali
On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote:
Hi experts,
Sorry if this is a n00b question or has already been answered...
Am trying to use the data frames API in python to join 2
Use Object[] in Java just works :).
On Fri, Apr 24, 2015 at 4:56 PM, Wenlei Xie wenlei@gmail.com wrote:
Hi,
I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java
by using an List? It looks like
ArrayListObject something;
Row.create(something)
will create a row
The standard incantation -- which is a little different from standard
Maven practice -- is:
mvn -DskipTests [your options] clean package
mvn [your options] test
Some tests require the assembly, so you have to do it this way.
I don't know what the test failures were, you didn't post them, but
Hi,
I may need to read many values. The list [0,4,5,6,8] is the locations of the
rows I’d like to extract from the RDD (of labledPoints). Could you possibly
provide a quick example?
Also, I’m not quite sure how this work, but the resulting RDD should be a
clone, as I may need to modify the
Try putting files with different file name and see if the stream is able to
detect them.
On 25-Apr-2015 3:02 am, Yang Lei [via Apache Spark User List]
ml-node+s1001560n22650...@n3.nabble.com wrote:
I hit the same issue as if the directory has no files at all when
running the sample
To run MLlib, you only need numpy on each node. For additional
dependencies, you can call the spark-submit with --py-files option and
add the .zip or .egg.
https://spark.apache.org/docs/latest/submitting-applications.html
Cheers,
Christian
On Fri, Apr 24, 2015 at 1:56 AM, Hoai-Thu Vuong
Does anyone know in which version of Spark will there be support for
ORCFiles via spark.sql.hive? Will it be in 1.4?
David
I'm deploying a Spark data processing job on an EC2 cluster, the job is small
for the cluster (16 cores with 120G RAM in total), the largest RDD has only
76k+ rows. But heavily skewed in the middle (thus requires repartitioning)
and each row has around 100k of data after serialization. The job
The problem I'm facing is that I need to process lines from input file in
the order they're stored in the file, as they define the order of updates I
need to apply on some data and these updates are not commutative so that
order matters. Unfortunately the input is purely order-based, theres no
Hi i have a next problem. I have a dataset with 30 columns (15 numeric,
15 categorical) and using ml transformers/estimators to transform each
column (StringIndexer for categorical MeanImputor for numeric). This
creates 30 more columns in a dataframe. After i’m using VectorAssembler
to
Thanks Dean,
Sure I have that setup locally and testing it with ZK.
But to start my multiple Masters do I need to go to each host and start
there or is there a better way to do this.
Regards
jk
On Fri, Apr 24, 2015 at 5:23 PM, Dean Wampler deanwamp...@gmail.com wrote:
The convention for
you used ZipWithUniqueID?
On 24 April 2015 at 21:28, Michal Michalski michal.michal...@boxever.com
wrote:
I somehow missed zipWithIndex (and Sean's email), thanks for hint. I mean
- I saw it before, but I just thought it's not doing what I want. I've
re-read the description now and it looks
Another issue is that hadooprdd (which sc.textfile uses) might split input
files and even if it doesn't split, it doesn't guarantee that part files
numbers go to the corresponding partition number in the rdd. Eg part-0
could go to partition 27
On Apr 24, 2015 7:41 AM, Michal Michalski
The convention for standalone cluster is to use Zookeeper to manage master
failover.
http://spark.apache.org/docs/latest/spark-standalone.html
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
I somehow missed zipWithIndex (and Sean's email), thanks for hint. I mean -
I saw it before, but I just thought it's not doing what I want. I've
re-read the description now and it looks like it might be actually what I
need. Thanks.
Kind regards,
Michał Michalski,
michal.michal...@boxever.com
On
I'd prefer to avoid preparing the file in advance by adding ordinals
before / after each line
I mean - I want to avoid doing it outside of spark of course. That's why I
want to achieve the same effect with Spark by reading the file as single
partition and zipping it with unique id which - I hope
I have an RDDObject which I get from Hbase scan using newAPIHadoopRDD. I
am running here ZipWithIndex and its preserving the order. first object got
1 second got 2 third got 3 and so on nth object got n.
On 24 April 2015 at 20:56, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
To maintain
Of course after you do it, you probably want to call repartition(somevalue)
on your RDD to get your paralellism back.
Kind regards,
Michał Michalski,
michal.michal...@boxever.com
On 24 April 2015 at 15:28, Michal Michalski michal.michal...@boxever.com
wrote:
I did a quick test as I was curious
Michael - you need to sort your RDD. Check out the shuffle documentation on the
Spark Programming Guide. It talks about this specifically. You can resolve this
in a couple of ways - either by collecting your RDD and sorting it, using
sortBy, or not worrying about the internal ordering. You can
Please see SPARK-2883
There is no Fix Version yet.
On Fri, Apr 24, 2015 at 5:45 PM, David Mitchell jdavidmitch...@gmail.com
wrote:
Does anyone know in which version of Spark will there be support for
ORCFiles via spark.sql.hive? Will it be in 1.4?
David
can you give an example set of data and desired output
On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie wenlei@gmail.com wrote:
Hi,
I would like to answer the following customized aggregation query on Spark
SQL
1. Group the table by the value of Name
2. For each group, choose the tuple with
Hi,
I checked the number of partitions by
System.out.println(INFO: RDD with + rdd.partitions().size() +
partitions created.);
Each single split is about 100MB. I am currently loading the data from
local file system, would this explains this observation?
Thank you!
Best,
Wenlei
On Tue, Apr
I just tested your pr
On 25 Apr 2015 10:18, Ali Bajwa ali.ba...@gmail.com wrote:
Any ideas on this? Any sample code to join 2 data frames on two columns?
Thanks
Ali
On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote:
Hi experts,
Sorry if this is a n00b question or has
Setting SPARK_CLASSPATH is triggering other errors. Not working.
On 25 April 2015 at 09:16, Manku Timma manku.tim...@gmail.com wrote:
Actually found the culprit. The JavaSerializerInstance.deserialize is
called with a classloader (of type MutableURLClassLoader) which has access
to all the
Here you go
t =
[[A,10,A10],[A,20,A20],[A,30,A30],[B,15,B15],[C,10,C10],[C,20,C200]]
TRDD = sc.parallelize(t).map(lambda t:
Row(name=str(t[0]),age=int(t[1]),other=str(t[2])))
TDF = ssc.createDataFrame(TRDD)
print TDF.printSchema()
TDF.registerTempTable(tab)
JN =
Actually found the culprit. The JavaSerializerInstance.deserialize is
called with a classloader (of type MutableURLClassLoader) which has access
to all the hive classes. But internally it triggers a call to loadClass but
with the default classloader. Below is the stacktrace (line numbers in the
Hi,
I would like to answer the following customized aggregation query on Spark
SQL
1. Group the table by the value of Name
2. For each group, choose the tuple with the max value of Age (the ages are
distinct for every name)
I am wondering what's the best way to do it on Spark SQL? Should I use
Solved! I have solved the problem combining both solutions. The result is
this:
messages.foreachRDD { rdd =
val message: RDD[String] = rdd.map { y = y._2 }
val sqlContext =
SQLContextSingleton.getInstance(rdd.sparkContext)
import
Hi,
I need compare the count of messages recived if is 0 or not, but
messages.count() return a DStream[Long]. I tried this solution:
val cuenta = messages.count().foreachRDD{ rdd =
rdd.first()
}
But
Yes.
Kind regards,
Michał Michalski,
michal.michal...@boxever.com
On 24 April 2015 at 17:12, Jeetendra Gangele gangele...@gmail.com wrote:
you used ZipWithUniqueID?
On 24 April 2015 at 21:28, Michal Michalski michal.michal...@boxever.com
wrote:
I somehow missed zipWithIndex (and Sean's
foreachRDD is an action and doesn't return anything. It seems like you
want one final count, but that's not possible with a stream, since
there is conceptually no end to a stream of data. You can get a stream
of counts, which is what you have already. You can sum those counts in
another data
FWIW, I ran into a similar issue on r3.8xlarge nodes and opted for
more/smaller executors. Another observation was that one large executor
results in less overall read throughput from S3 (using Amazon's EMRFS
implementation) in case that matters to your application.
-Sven
On Thu, Apr 23, 2015 at
The exception looks like the one mentioned in
https://issues.apache.org/jira/browse/SPARK-4520. What is the version of
Spark?
On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
My data looks like this:
+---++--+
|
To maintain the order you can use zipWithIndex as Sean Owen pointed out. This
is the same as zipWithUniqueId except the assigned number is the index of the
data in the RDD which I believe matches the order of data as it's stored on
HDFS.
Sent with Good (www.good.com)
-Original
On Fri, Apr 24, 2015 at 4:56 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
Thanks Dragos,
Earlier test shows spark.streaming.concurrentJobs has worked.
Glad to hear it worked!
iulian
Regards,
Laeeq
On Friday, April 24, 2015 11:58 AM, Iulian Dragoș
iulian.dra...@typesafe.com
I have a cluster launched with spark-ec2.
I can see a TachyonMaster process running,
but I do not seem to be able to use tachyon from the spark-shell.
if I try
rdd.saveAsTextFile(tachyon://localhost:19998/path)
I get
15/04/24 19:18:31 INFO TaskSetManager: Starting task 12.2 in stage 1.0 (TID
Hi Everyone,
Here's the Scala code for generating the EdgeRDD, VertexRDD, and Graph:
//Generate a mapping of vertex (edge) names to VertexIds
val vertexNameToIdRDD = rawEdgeRDD.flatMap(x =
Seq(x._1.src,x._1.dst)).distinct.zipWithUniqueId.cache
//Generate VertexRDD with vertex data (in my case,
Hi,
I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java
by using an List? It looks like
ArrayListObject something;
Row.create(something)
will create a row with single column (and the single column contains the
array)
Best,
Wenlei
I have an RDD of LabledPoints.
Is it possible to select a subset of it based on a list of indeces?
For example with idx=[0,4,5,6,8], I'd like to be able to create a new RDD with
elements 0,4,5,6 and 8.
-
To unsubscribe,
Hi TD,
That little experiment helped a bit. This time we did not see any
exceptions for about 16 hours but eventually it did throw the same
exceptions as before. The cleaning of the shuffle files also stopped much
before these exceptions happened - about 7-1/2 hours after startup.
I am not quite
I am also facing the same problem with spark 1.3.0 and yarn-client and
yarn-cluster mode. Launching yarn container failed and this is the error in
stderr:
Container: container_1429709079342_65869_01_01
You should probably open a JIRA issue with this i think.
Thanks
Best Regards
On Fri, Apr 24, 2015 at 3:27 AM, Daniel Mahler dmah...@gmail.com wrote:
Hi Akhil
I can confirm that the problem goes away when jsonRaw and jsonClean are in
different s3 buckets.
thanks
Daniel
On Thu, Apr 23,
Hi,
I would like to know if it is possible to build the DAG before actually
executing the application. My guess is that in the scheduler the DAG is
built dynamically at runtime since it might depend on the data, but I was
wondering if there is a way (and maybe a tool already) to analyze the code
I came across the feature in spark where it allows you to schedule different
tasks within a spark context. I want to implement this feature in a program
where I map my input RDD(from a text source) into a key value RDD [K,V]
subsequently make a composite key value RDD [(K1,K2),V] and and a
aborted due to stage failure: Task 5 in
stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID
23, 10.253.1.117): ExecutorLostFailure (executor
20150424-104711-1375862026-5050-20113-S1 lost)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache
I hit the same issue as if the directory has no files at all when running
the sample examples/src/main/python/streaming/hdfs_wordcount.py with a
local directory, and adding file into that directory . Appreciate comments
on how to resolve this.
--
View this message in context:
failure: Task 5
in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage
0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor
20150424-104711-1375862026-5050-20113-S1 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler
failed 4 times, most recent failure: Lost task 5.3 in stage
0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor
20150424-104711-1375862026-5050-20113-S1 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler
The solution depends largely on your use case. I assume the index is in the
key. In that case, you can make a second RDD out of the list of indices and
then use cogroup() on both.
If the list of indices is small, just using filter() will work well.
If you need to read back a few select values to
88 matches
Mail list logo