Of course, I could create a connection in
val result = rdd.map(line = {
val conf = HBaseConfiguration.create
val connection = HConnectionManager.createConnection(conf)
val table = connection.getTable(user)
...
table.close()
connection.close()
}
but that would be too slow, which is
Hi Ankur,
If your rdds have common keys, you can look at partitioning both your
datasets using a custom partitioner based on keys so that you can avoid
shuffling and optimize join performance.
HTH
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
Found the root cause:
[SPARK-3372] [MLlib] MLlib doesn't pass maven build / checkstyle due ...
...to multi-byte character contained in Gradient.scala
Author: Kousuke Saruta saru...@oss.nttdata.co.jp
Closes #2248https://github.com/apache/spark/pull/2248 from sarutak/SPARK-3372
and
This is how you deal with deduplicate errors:
libraryDependencies ++= Seq(
(org.apache.spark % spark-streaming_2.10 % 1.1.0 % provided).
*exclude(org.eclipse.jetty.orbit, javax.transaction).*
*exclude(org.eclipse.jetty.orbit, javax.mail).*
*exclude(org.eclipse.jetty.orbit,
I'm trying to understand two things about how spark is working.
(1) When I try to cache an rdd that fits well within memory (about 60g with
about 600g of memory), I get seemingly random levels of caching, from
around 60% to 100%, given the same tuning parameters. What governs how
much of an RDD
Oh, I forgot - I've set the following parameters at the moment (besides the
standard location, memory, and core setup):
spark.logConf true
spark.shuffle.consolidateFiles true
spark.ui.port 4042
spark.io.compression.codec
Hi,
i using the comma separated style for submit multiple jar files in the
follow shell but it does not work:
bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans
--master yarn-cluster --execur-memory 2g *--jars
Dear all,
Is it possible to use any kind of object as key in a PairedRDD. When I use
a case class key, the groupByKey operation don't behave as I expected. I
want to use a case class to avoid using a large tuple as it is easier to
manipulate.
Cheers,
Jaonary
We use our custom classes which are Serializable and have well defined
hashcode and equals methods through the Java API. Whats the issue you are
getting?
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Fri, Oct 17, 2014 at 12:28 PM, Jaonary
Here what I'm trying to do. My case class is the following :
case class PersonID(id: String, group: String, name: String)
I want to use PersonID as a key in a PairedRDD. But I think the default
equal function don't fit to my need because two PersonID(a,a,a) are
not the same. When I use a tuple
Hello,
I would like to use areaUnderROC from MLlib in Apache Spark. I am currently
running Spark 1.1.0 and this function is not available in pyspark but is
available in scala.
Is there a feature tracker that tracks the advancement of porting Scala apis
to Python apis?
I have tried to search in
Hello MLnick,
Have you found a solution on how to install MLlib for Mac OS ? I have also
some trouble to install the dependencies.
Best,
poiuytrez
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-linking-error-Mac-OS-X-tp588p16668.html
Sent from the
Dear all,
In my test programer, there are 3 partitions for each RDD, the iteration
procedure is as follows:
var rdd_0 = ... // init
for (...) {
*rdd_1* = *rdd_0*.reduceByKey(...).partitionBy(p) // calculate rdd_1
from rdd_0
*rdd_0* = *rdd_0*.partitionBy(p).join(*rdd_1*)... //
Hi,
I have seen in the video from Spark summit that usually (when I use HDFS) are
data distributed across the whole cluster and usually computations goes to the
data.
My question is how does it work when I read the data from Amazon S3? Is the
whole input dataset readed by the master node
hi,
I have cluster that has several nodes and every node has several cores. I'd
like to run multi-core algorithm within every map. So I'd like to assure that
there will be performed only one map per cluster node. Is there some way, how
to assure this? It seems to me that it should be possible
Hi all,
I have a Spark Streaming application running on a cluster, deployed with
the spark-submit script. I was reading here that it's possible to
gracefully shutdown the application in order to allow the deployment of a
new one:
Hi,
When I run given Spark SQL commands in the exercise, it returns with
unexpected results. I'm explaining the results below for quick reference:
1. The output of query : wikiData.count() shows some count in the file.
2. after running following query:
sqlContext.sql(SELECT username, COUNT(*)
Hi,
Probably I am missing very simple principle , but something is wrong with
my filter,
i get org.apache.spark.SparkException: Task not serializable expetion.
here is my filter function:
object OBJ {
def f1(): Boolean = {
var i = 1;
for (j-1 to 10) i = i +1;
true;
It might be due to the object is nested within some class which may not be
serializable.
Also you can run the appluication using this jvm parameter to see detailed
info about serialization -Dsun.io.serialization.extendedDebugInfo=true
On Fri, Oct 17, 2014 at 4:07 PM, shahab
Hi all,
I need to compute a similiarity between elements of two large sets of high
dimensional feature vector.
Naively, I create all possible pair of vectors with
* features1.cartesian(features2)* and then map the produced paired rdd with
my similarity function.
The problem is that the cartesian
Hi all,
I am running a standalone spark cluster with a single master. No HA or
failover is configured explicitly (no ZooKeeper etc).
What is the default designed behavior for submission of new jobs when a
single master went down or became unreachable?
I couldn't find it documented anywhere.
Hi,
I have been using spark sql with yarn.
It works fine with yarn-client mode, but with yarn-cluster mode, we are
facing 2 issues. Is yarn-cluster mode not recommended for spark-sql using
hiveContext ??
*Problem #1*
We are not able to use any query with very simple filtering operation
like,
Cartesian joins of large datasets are usually going to be slow. If there
is a way you can reduce the problem space to make sure you only join
subsets with each other, that may be helpful. Maybe if you explain your
problem in more detail, people on the list can come up with more
suggestions.
Best
Hi,
Thanks a lot for your reply. It is true that python API has default
parameters except ranks(the default iterations is 5). At the very beginning,
in order to estimate the speed of ALS.trainImplicit, I used
ALS.trainImplicit(ratings, rank, 1) and it worked. So I tried ALS with more
iterations,
Hi,
Today, I tried again with the following code, but it didn't work...
Could you please tell me your running environment?
/from pyspark.mllib.recommendation import ALS
from pyspark import SparkContext
sc = SparkContext()
r1 = (1, 1, 1.0)
r2 = (1, 2, 2.0)
r3 = (2, 1, 2.0)
ratings =
For posterity's sake, I solved this. The problem was the Cloudera cluster
I was submitting to is running 1.0, and I was compiling against the latest
1.1 release. Downgrading to 1.0 on my compile got me past this.
On Tue, Oct 14, 2014 at 6:08 PM, Michael Campbell
michael.campb...@gmail.com
Hello,
My SBT pulls in, among others, the following dependency for Spark 1.1.0:
akka-actor_2.10-2.2.3-shaded-protobuf.jar
What is this? How is this different from the regular Akka Actor JAR? How do I
reconcile with other libs that use Akka, such as Play?
Thanks!
Best,
Issue was solved by clearing hashmap and hashset at the beginning of the call
method.
From: Jacob Maloney [mailto:jmalo...@conversantmedia.com]
Sent: Thursday, October 16, 2014 5:09 PM
To: user@spark.apache.org
Subject: Strange duplicates in data when scaling up
I have a flatmap function that
Hi Sonal
Thank you for the response but since we are joining to reference data
different partitions of application data would need to join with same
reference data and thus I am not sure if spark join would be a good fit for
this.
Eg out application data has person with zip code and then the
Looks like this data was encoded with an old version of Spark SQL. You'll
need to set the flag to interpret binary data as a string. More info on
configuration can be found here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
sqlContext.sql(set
Hi,
I created an issue in JIRA.
https://issues.apache.org/jira/browse/SPARK-3990
https://issues.apache.org/jira/browse/SPARK-3990
I uploaded the error information in JIRA. Thanks in advance for your help.
Best
Gen
Davies Liu-2 wrote
It seems a bug, Could you create a JIRA for it? thanks!
Hello,
Maybe there is something I do not get to understand, but I believe this code
should not throw any serialization error when I run this in the spark shell.
Using similar code with map instead of mapPartitions works just fine.
import java.io.BufferedInputStream
import java.io.FileInputStream
Thank you for sharing this Cheng! This is fantastic. I was able to
implement it and it seems like it's working quite well. I'm definitely on
the right track now!
I'm still having a small problem with the rows inside each partition being
out of order - but I suspect this is because in the code
My goal is for rows to be partitioned according to timestamp bins (e.g.
with each partition representing an even interval of time), and then
ordered by timestamp *within* each partition. Ordering by user ID is not
important. In my aggregate function, in the seqOp function, I am checking
to verify
interesting. why does case class work for this? thanks boromir!
On Thu, Oct 16, 2014 at 10:41 PM, Boromir Widas vcsub...@gmail.com wrote:
make it a case class should work.
On Thu, Oct 16, 2014 at 8:30 PM, ll duy.huynh@gmail.com wrote:
i got an exception complaining about serializable.
I wasn't the original person who posted the question, but this helped me! :)
Thank you.
I had a similar issue today when I tried to connect using the IP address
(spark://master_ip:7077). I got it resolved by replacing it with the URL
displayed in the Spark web console - in my case it is
You will have to write something in your app like an endpoint or button
that triggers this code in your app.
Hi all,
I have a Spark Streaming application running on a cluster, deployed with
the spark-submit script. I was reading here that it's possible to
gracefully shutdown the application in
They should be the same except the package names are changed to avoid protopuf
conflict. You can use it just like other Akka jars
Chester
Sent from my iPhone
On Oct 17, 2014, at 5:56 AM, Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com wrote:
Hello,
My SBT pulls in,
Hi,
How can I convert an RDD loaded from a Parquet file into its original type:
case class Person(name: String, age: Int)
val rdd: RDD[Person] = ...
rdd.saveAsParquetFile(people)
val rdd2: sqlContext.parquetFile(people)
How can I map rdd2 back into an RDD[Person]? All of the examples just
There seems to be some problem with what gets captured in the closure
that's passed into the mapPartitions (myfunc in your case).
I've had a similar problem before:
http://apache-spark-user-list.1001560.n3.nabble.com/TaskNotSerializableException-when-running-through-Spark-shell-td16574.html
Try
Hi,
When trying to insert records into HIVE, I got error,
My Spark is 1.1.0 and Hive 0.12.0
Any idea what would be wrong?
Regards
Arthur
hive CREATE TABLE students (name VARCHAR(64), age INT, gpa int);
OK
hive INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1);
Hm, it works for me. Are you sure you have provided the right jars? What
happens if you pass in the `--verbose` flag?
2014-10-16 23:51 GMT-07:00 eric wong win19...@gmail.com:
Hi,
i using the comma separated style for submit multiple jar files in the
follow shell but it does not work:
I believe I have a similar question to this. I would like to process an
offline event stream for testing/debugging. The stream is stored in a CSV
file where each row in the file has a timestamp. I would like to feed this
file into Spark Streaming and have the concept of time be driven by the
hello... is there a list that shows the complexity of each
action/transformation? for example, what is the complexity of
RDD.map()/filter() or RowMatrix.multiply() etc? that would be really
helpful.
thanks!
--
View this message in context:
The RDDs aren't changing; you are assigning new RDDs to rdd_0 and
rdd_1. Operations like join and reduceByKey are making distinct, new
partitions that don't correspond 1-1 with old partitions anyway.
On Fri, Oct 17, 2014 at 5:32 AM, randylu randyl...@gmail.com wrote:
Dear all,
In my test
It doesn't contain commons math3 since Spark does not depend on it.
Its tests do, but tests are not built into the Spark assembly.
On Thu, Oct 16, 2014 at 9:57 PM, Henry Hung ythu...@winbond.com wrote:
HI All,
I try to build spark 1.1.0 using sbt with command:
sbt/sbt
On top of what Andrew said, you shouldn't need to manually add the
mllib jar to your jobs; it's already included in the Spark assembly
jar.
On Thu, Oct 16, 2014 at 11:51 PM, eric wong win19...@gmail.com wrote:
Hi,
i using the comma separated style for submit multiple jar files in the
follow
I'm not sure what the design is, but I think the current behavior if the
driver can't reach the master is to attempt to connect once and fail if
that attempt fails. Is that what you're observing? (What version of Spark
also?)
On Fri, Oct 17, 2014 at 3:51 AM, preeze etan...@gmail.com wrote:
Hi
On 10/17/2014 02:08 PM, ll wrote:
hello... is there a list that shows the complexity of each
action/transformation? for example, what is the complexity of
RDD.map()/filter() or RowMatrix.multiply() etc? that would be really
helpful.
thanks!
I'm pretty new to Spark, so I only know about
Hi,
if Spark thrift JDBC server is started with non-secure mode, it is working
fine. with a secured mode in case of pluggable authentication, I placed the
authentication class configuration in conf/hive-site.xml
property
namehive.server2.authentication/name
valueCUSTOM/value
/property
Hi,
We have been implementing several Spark Streaming jobs that are basically
processing data and inserting it into Cassandra, sorting it among different
keyspaces.
We've been following the pattern:
dstream.foreachRDD(rdd =
val records = rdd.map(elem = record(elem))
Hi Reza,
Thank you for the suggestion. The number of point are not that large about
1000 for each set. So I have 1000x1000 pairs. But, my similarity is
obtained using a metric learning to rank and from spark it is viewed as a
black box. So my idea was just to distribute the computation of the
It may also cause a problem when running in the yarn-client mode. If
--driver-memory is large, Yarn has to allocate a lot of memory to the AM
container, but AM doesn't really need the memory.
Boduo
--
View this message in context:
hello... i'm looking at the source code for mllib.linalg.Vectors and it looks
like it's a wrapper around Breeze with very small changes (mostly changing
the names).
i don't have any problem with using spark wrapper around Breeze or Breeze
directly. i'm just curious to understand why this wrapper
I don't know the answer for sure, but just from an API perspective I'd
guess that the Spark authors don't want to tie their API to Breeze. If at a
future point they swap out a different implementation for Breeze, they
don't have to change their public interface. MLlib's interface remains
https://gist.github.com/rjurney/fd5c0110fe7eb686afc9
Any way I try to join my data fails. I can't figure out what I'm doing
wrong.
--
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
ᐧ
Yes, I think that's the logic, but then what do toBreezeVector return
if it is not based on Breeze? and this is called a lot by client code
since you often have to do something nontrivial to the vector. I
suppose you can still have that thing return a Breeze vector and use
it for no other purpose.
Is it possible to disable input split if input is already small?
You can save to a local file. What are you trying and what doesn't work?
You can output one file by repartitioning to 1 partition but this is
probably not a good idea as you are bottlenecking the output and some
upstream computation by disabling parallelism.
How about just combining the files on
Hey Russell,
join() can only work with RDD of pairs (key, value), such as
rdd1: (k, v1)
rdd2: (k, v2)
rdd1.join(rdd2) will be (k1, v1, v2)
Spark SQL will be more useful for you, see
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
Davies
On Fri, Oct 17, 2014 at 5:01 PM,
Is that not exactly what I've done in j3/j4? The keys are identical
strings.The k is the same, the value in both instances is an associative
array.
devices = devices.map(lambda x: (dh.index('id'), {'deviceid':
x[dh.index('id')], 'foo': x[dh.index('foo')], 'bar':
x[dh.index('bar')]}))
bytes_in_out
Thanks, Andrew. What about reading out of local?
On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash and...@andrewash.com wrote:
When reading out of HDFS it's the HDFS block size.
On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu larryli...@gmail.com wrote:
What is the default input split size? How to
There was a bug in the devices line: dh.index('id') should have been
x[dh.index('id')].
ᐧ
On Fri, Oct 17, 2014 at 5:52 PM, Russell Jurney russell.jur...@gmail.com
wrote:
Is that not exactly what I've done in j3/j4? The keys are identical
strings.The k is the same, the value in both instances
63 matches
Mail list logo