@sowen.. i am looking for distributed operations, especially very large
sparse matrix x sparse matrix multiplication. what is the best way to
implement this in spark?
--
View this message in context:
Here is the link on jira: https://issues.apache.org/jira/browse/SPARK-4243
https://issues.apache.org/jira/browse/SPARK-4243
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p18166.html
Sent from the Apache Spark User List
hi TD,
I would like to run streaming 24/7 and trying to use get or create but its
not working please can you help on this
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-getOrCreate-tc18060.html
--
View this message in context:
local matrix-matrix multiplication or distributed?
On Tue, Nov 4, 2014 at 11:58 PM, ll duy.huynh@gmail.com wrote:
what is the best way to implement a sparse x sparse matrix multiplication
with spark?
--
View this message in context:
We are working on distributed block matrices. The main JIRA is at:
https://issues.apache.org/jira/browse/SPARK-3434
The goal is to support basic distributed linear algebra, (dense first
and then sparse).
-Xiangrui
On Wed, Nov 5, 2014 at 12:23 AM, ll duy.huynh@gmail.com wrote:
@sowen.. i
Hi Thanks for replying,
I have posted my code in
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-getOrCreate-tc18060.html
--
View this message in context:
Hey, guys. Here's my problem:
While using the standalone mode, I always use the following args for
executor:
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc
-Xloggc:/tmp/spark.executor.gc.log
But as we know, hotspot JVM does not support variable substitution on
-Xloggc parameter, which
distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix).
thanks xiangrui!
On Wed, Nov 5, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:
local matrix-matrix multiplication or distributed?
On Tue, Nov 4, 2014 at 11:58 PM, ll duy.huynh@gmail.com wrote:
what is
ok great. when will this be ready?
On Wed, Nov 5, 2014 at 4:27 AM, Xiangrui Meng men...@gmail.com wrote:
We are working on distributed block matrices. The main JIRA is at:
https://issues.apache.org/jira/browse/SPARK-3434
The goal is to support basic distributed linear algebra, (dense first
in case, this won't be available anytime soon with spark. what would be a
good way to implement this multiplication feature in spark?
On Wed, Nov 5, 2014 at 4:59 AM, Duy Huynh duy.huynh@gmail.com wrote:
distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix).
thanks
Currently the createParquetMethod needs BeanClass as one of the parameters.
javahiveContext.createParquetFile(XBean.class,
IMPALA_TABLE_LOC, true, new Configuration())
Has there been a change in Creating an input stream with an actor receiver?
I was able to get it working with v1.0.1 but not with any other version
after that.
I tried doing so with EchoActor and get serialization errors. I have also
reported an issue about this SPARK-4171
Hi Xiangrui,
Thanks for the reply. is this still due to be released in 1.2
(SPARK-3530 is still open)?
Thanks,
On Wed, Nov 5, 2014 at 3:21 AM, Xiangrui Meng men...@gmail.com wrote:
The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
this issue. We carry over extra columns with
I have a dataset comprised of ~200k labeled points whose features are
SparseVectors with ~20M features. I take 5% of the data for a training set.
model = LogisticRegressionWithSGD.train(training_set)
fails with
ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent
Hi,
The docs specify that we can control the amount of ram / cores available via:
-c CORES, --cores CORESTotal CPU cores to allow Spark applications to use on
the machine (default: all available); only on worker-m MEM, --memory MEMTotal
amount of memory to allow Spark applications to use on the
You can set those inside the spark-defaults.conf file under the conf
directory inside your spark installation.
Thanks
Best Regards
On Wed, Nov 5, 2014 at 4:51 PM, Ashic Mahtab as...@live.com wrote:
Hi,
The docs specify that we can control the amount of ram / cores available
via:
-c CORES,
Thanks Regards,
Mrugen Deshmukh.
(M.S. Software Engineering - San Jose State University)
[image: http://www.linkedin.com/in/mrugendeshmukh]
http://www.linkedin.com/in/mrugendeshmukh
Hi,
Just wondering what is the reason that the decision tree implementation in
spark always does binary splits?
thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188.html
Sent from the Apache Spark User List
If you go to your spark job UI (probably on http://master-node:4040), and
click on the environment tab, you can check if the setting are correctly
picked up by spark.
Also when you run the job, you can see the subtasks (stages tab), inside the
task you can check what resources are assigned to the
Hi Rok,
you could try to debug it by first collecting your training_set, see if it
gets you something back, before passing it to the train method. Then go
through each line in the train method, also the serializer and check where
it fails exactly.
thanks,
--
View this message in context:
To unsubscribe, send an email to user-unsubscr...@spark.apache.org
Read more over here https://spark.apache.org/community.html.
Thanks
Best Regards
On Wed, Nov 5, 2014 at 6:03 PM, mrugen deshmukh mrugenm...@gmail.com
wrote:
Thanks Regards,
Mrugen Deshmukh.
(M.S. Software Engineering - San
Nothing on log4j logs. I figured it out by comparing my code to the
examples.
On Wed, Nov 5, 2014 at 4:17 AM, sivarani whitefeathers...@gmail.com wrote:
hi TD,
I would like to run streaming 24/7 and trying to use get or create but its
not working please can you help on this
I have tried it out to merge the file to one, Spark is now working with RAM as
I've expected.
Unfortunately after doing this there appears another problem. Now Spark running
on YARN is scheduling all the work only to one worker node as a one big job. Is
there some way, how to force Spark and
Siarani, does your spark-master look like it's still up (i.e. if you check
the UI?).
I cannot tell if you see this error on get or initial create. You can
start debugging by dumping out the value of master in setMaster(master) --
especially if this failure is from the intial startup
From the
Ok so the problem was solved, it that the file was gziped and it looks that
Spark does not support direct .gz file distribution to workers.
Thank you very much fro the suggestion to merge the files.
Best regards,
Jan
__
I have
yes, the training set is fine, I've verified it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18195.html
Sent from the Apache Spark User List mailing list archive at
I am running the same version of spark in the server (master + worker) and
in the client / driver.
For the server I am using the binaries spark-1.1.0-bin-hadoop1
And in the client I am using the same version:
dependency
groupIdorg.apache.spark/groupId
Hello Friends:
I was temporarily using a manual build of Spark v1.1.0, until Cloudera CDH5
RPMs were updated to that latest version. So now I'm back to using the
CDH5.2 Spark v1.1.0 distribution.
That was just a preamble note for completeness. :)
Now when I go to start the master as follows, it
Hi Guys,
As part of debugging this native library error in our environment, it
would be great if somebody can help me with this question. What kind of
temp, scratch, and staging directories does Spark need and use on the slave
nodes in the YARN cluster mode?
Thanks,
Aravind
On Mon, Nov 3, 2014
Could you please give me an example or send me a link of how to use Hadoop
CombinedFileInputFormat? It sound very interesting to me and it would probably
save me several hours of my pipeline computation. Merging of the files is
currently the bottleneck in my system.
Hi,
I would like to understand the pipeline of spark's operation(transformation
and action) and some details on block storage.
Let's consider the following code:
val rdd1 = SparkContext.textFile(hdfs://...)
rdd1.map(func1).map(func2).count
For example, we have a file in hdfs about 80Gb,
Hi
One question about the power of spark.shuffle.spill -
(I know this has been asked several times :-)
Basically, in handling a (cached) dataset that doesn't fit in memory, Spark can
spill it to disk.
However, can I say that, when this is enabled, Spark can handle the situation
faultlessly,
You can use breeze for local sparse-sparse matrix multiplication and
then define an RDD of sub-matrices
RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)
and then use join and aggregateByKey to implement this feature, which
is the same as in MapReduce.
-Xiangrui
Which Spark version did you use? Could you check the WebUI and attach
the error message on executors? -Xiangrui
On Wed, Nov 5, 2014 at 8:23 AM, rok rokros...@gmail.com wrote:
yes, the training set is fine, I've verified it.
--
View this message in context:
As suggested by Qiaou, looked at the UI:
1) Under 'Stages' the only 'active' stage is: runJob at
ReceiverTracker.scala:275
2) Under 'Executors', there's only 1 active task, but I don't see any
output (or logs)
3) Under 'Streaming', there's one receiver called, 'KafkaReciever-0', but
'Records
That method is for creating a new directory to hold parquet data when there
is no hive metastore available, thus you have to specify the schema.
If you've already created the table in the metastore you can just query it
using the sql method:
javahiveConxted.sql(SELECT * FROM parquetTable);
You
We are working on a PRs to add block partitioned matrix formats and dense
matrix multiply methods. This should be out in the next few weeks or so.
The sparse methods still need some research on partitioning schemes etc.
and we will do that after the dense methods are in place.
Thanks
Shivaram
On
Nice.
Then I have another question, if I have a file (or a set of files: part-0,
part-1, might be a few hundreds MB csv to 1-2 GB, created by other program),
need to create hashtable from it, later broadcast it to each node to allow
query (map side join). I have two options to do it:
1, I can
And another similar case:
If I have get a RDD from previous step, but for next step it should be a map
side join (so I need to broadcast this RDD to every nodes). What is the best
way for me to do that? Collect RDD in driver first and create broadcast? Or
any shortcut in spark for this?
Thanks!
Dear Spark users,
I would like to run a long experiment using spark-shell. How can I log my
intermediate results (numbers, strings) into some file on a master node? What
are the best practices? It is NOT performance metrics of Spark that I want to
log every X seconds. Instead, I would like to
How can I read/write AVRO specific records?
I found several snippets using generic records, but nothing with specific
records so far.
Thanks,
Simone Franzini, PhD
http://www.linkedin.com/in/simonefranzini
I need to sort my RDD partitions but the whole partition(s) might not fit
into memory, so I cannot run the Collections Sort() method. Does Spark
support partitions sorting by virtue of its framework? I am working on 1.1.0
version.
I looked up similar unanswered question:
Hi Simone,
Matt Massie put together a good tutorial on his blog. If you’re looking for
more code using Avro, we use it pretty extensively in our genomics project. Our
Avro schemas are here, and we have serialization code here. We use Parquet for
storing the Avro records, but there is also an
Something like this works and is how I create an RDD of specific records.
val avroRdd = sc.newAPIHadoopFile(twitter.avro,
classOf[AvroKeyInputFormat[twitter_schema]], classOf[AvroKey[twitter_schema]],
classOf[NullWritable], conf) (From
Hi,
I’m working on an use case using Spark streaming. I need to process a RDD of
strings so that they will be grouped by IP and sorted by time. Could somebody
tell me the right transformation?
Input:
2014-10-23 08:18:38,904 [192.168.10.1]
2014-10-23 08:18:38,907 [192.168.10.1] ccc
Hi,
On a 5 node cluster, say I have data on the driver application node,
and then I call parallelize on the data, I get a rdd back.
However, when I call cache on the rdd the rdd won't be cached (I checked
that through timing on count the realized-cached rdd, take as long as before
it was
I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD.
Creating the new RDD works fine but setting up the configuration file via
the static methods on input formats that require a Hadoop Job object is
proving to be difficult.
Trying to new up my own Job object with the
The closer I look @ the stack trace in the Scala shell, it appears to be
the call to toString() that is causing the construction of the Job object
to fail. Is there a ways to suppress this output since it appears to be
hindering my ability to new up this object?
On Wed, Nov 5, 2014 at 5:49 PM,
Hi all, I encounter this error when execute the query
sqlContext.sql(select percentile(age, array(0, 0.5, 1)) from people).collect()
java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer
cannot be cast to [Ljava.lang.Object;
at
Hi all,
We are excited to announce that the benchmark entry has been reviewed by
the Sort Benchmark committee and Spark has officially won the Daytona
GraySort contest in sorting 100TB of data.
Our entry tied with a UCSD research team building high performance systems
and we jointly set a new
This problem turned out to be a cockpit error. I had the same class name
defined in a couple different files, and didn't realize SBT was compiling
them all together, and then executing the wrong one. Mea culpa.
--
View this message in context:
If you start with an RDD, you do have to collect to the driver and broadcast to
do this. Between the two options you listed, I think this one is simpler to
implement, and there won't be a huge difference in performance, so you can go
for it. Opening InputStreams to a distributed file system by
My use case has one large data stream (DS1) that obviously maps to a DStream.
The processing of DS1 involves filtering it for any of a set of known
values, which will change over time, though slowly by streaming standards.
If the filter data were static, it seems to obviously map to a broadcast
You can also use the Kite SDK to read/write Avro records:
https://github.com/kite-sdk/kite-examples/tree/master/spark
- Anand
On Wed, Nov 5, 2014 at 2:24 PM, Laird, Benjamin
benjamin.la...@capitalone.com wrote:
Something like this works and is how I create an RDD of specific records.
val
In my spark job, I have a loop something like this:
bla.forEachRdd(rdd = {
//init some vars
rdd.forEachPartition(partiton = {
//init some vars
partition.foreach(kv = {
...
I am seeing serialization errors (unread block data), because I think spark
is trying to serialize the
Congrats to everyone who helped make this happen. And if anyone has even more
machines they'd like us to run on next year, let us know :).
Matei
On Nov 5, 2014, at 3:11 PM, Reynold Xin r...@databricks.com wrote:
Hi all,
We are excited to announce that the benchmark entry has been
Hello Friends:
I cringe to ask this off-topic question (so forgive me in advance).
I'm trying to figure out how to receive only the digest email for this Spark
User List, yet
still be able to email questions to it.
Subscribing to the 'user-dig...@spark.incubator.apache.org' alias does
provide
I'm using this system
Hadoop 1.0.4
Scala 2.9.3
Hive 0.9.0
With spark 1.1.0. When importing pyspark, I'm getting this error:
from pyspark.sql import *
Traceback (most recent call last):
File stdin, line 1, in ?
File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in ?
from
What's the version of Python? 2.4?
Davies
On Wed, Nov 5, 2014 at 4:21 PM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
I’m using this system
Hadoop 1.0.4
Scala 2.9.3
Hive 0.9.0
With spark 1.1.0. When importing pyspark, I’m getting this error:
from pyspark.sql import *
Hi, all, I noticed that when compiling the SparkSQL with profile hive-0.13.1,
it will fetch the Hive version of 0.13.1a under groupId
org.spark-project.hive, what's the difference with the one of
org.apache.hive? And where can I get the source code for re-compiling?
Thanks,
Cheng Hao
Which version are you using? I can reproduce that in the latest code, but with
different exception.
I've filed an bug https://issues.apache.org/jira/browse/SPARK-4263, can you
also add some information there?
Thanks,
Cheng Hao
-Original Message-
From: Kevin Paul
I've tried to set the log4j logger to warn only via log4j properties file in
cat src/test/resources/log4j.properties
log4j.logger.org.apache.spark=WARN
or in sbt via
javaOptions += -Dlog4j.logger.org.apache.spark=WARN
But the logger still gives me INFO messages to stdout when I run my tests
Hi All,
I’m trying to write streaming processed data in HDFS (Hadoop 2). The buffer is
flushed and closed after each writing. The following errors occurred when
opening the same file to append. I know for sure the error is caused by closing
the file. Any idea?
Here is the code to write HDFS
On Mon, Oct 27, 2014 at 7:37 PM, buring qyqb...@gmail.com wrote:
Here is error log,I abstract as follows:
INFO [binaryTest---main]: before first
WARN [org.apache.spark.scheduler.TaskSetManager---Result resolver
thread-0]: Lost task 0.0 in stage 0.0 (TID 0, spark-dev136):
I noticed a behaviour where it was observed that, if i'm using
val temp = sc.parallelize ( 1 to 10)
temp.collect
Task size will be in bytes let's say ( 1120 bytes).
But if i change this to a for loop
import scala.collection.mutable.ArrayBuffer
val data= new ArrayBuffer[Integer]()
for(i -
From what i've observed, there are no debug logs while serialization takes
place. You can see the source code if you want, TaskSetManager class has
some functions for serialization.
--
View this message in context:
Hello Kevin,
https://issues.apache.org/jira/browse/SPARK-3891 will fix this bug.
Thanks,
Yin
On Wed, Nov 5, 2014 at 8:06 PM, Cheng, Hao hao.ch...@intel.com wrote:
Which version are you using? I can reproduce that in the latest code, but
with different exception.
I've filed an bug
The original spark-project hive-0.13.1 has some problem with packaging causing
version conflicts, and hive-0.13.1a is repackaged to solve the problem. They
share the same official hive source code release 0.13.1, with unnecessary
package removed from the original official hive release package.
Dear All,
I am getting java.io.NotSerializableException for below code. if
jssc.checkpoint(HDFS_CHECKPOINT_DIR); is blocked there is not exception
Please help
JavaStreamingContextFactory contextFactory = new
JavaStreamingContextFactory() {
@Override
public JavaStreamingContext create() {
For 2), If the input is Range, Spark only needs the start value and the end
value for each partition, so the overhead of Range is little. But
for ArrayBuffer, Spark needs to serialize all of the data into the task.
That's why it's huge in your case.
For 1), Spark does not always travel the data
Two limitations we found here:
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-in-quot-cogroup-quot-td17349.html
Best Regards,
Shixiong Zhu
2014-11-06 2:04 GMT+08:00 Yangcheng Huang yangcheng.hu...@huawei.com:
Hi
One question about the power of spark.shuffle.spill –
(I
Thanks boss its working :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Submiting-Spark-application-through-code-tp17452p18250.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Any one, any luck?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-foreachRDD-network-output-tp15205p18251.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
How about adding the following in your $SPARK_HOME/conf/log4j.properties
file?
# Set WARN to be logged to the console
*log4j.rootCategory=WARN, console*
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
This is more about mechanism of Scala compiler and Java serialization.
By default, Java will serialize an object deeply and recursively.
Secondly, how Scala compiler generates the byte codes does matter. I'm not
a Scala expert. Here is just some observation:
1. If the function does not use any
Hi,
I have a 2 node yarn cluster and I am using spark 1.1.0 to submit my tasks.
As per the documentation of spark, number of cores are maximum cores available.
So does it mean each node creates no of cores = no of threads to process the
job assigned to that node.
For ex,
ListInteger
Broadcast vars should work fine in Spark streaming. Broadcast vars are
immutable however. If you have some info to cache which might change
from batch to batch, you should be able to load it at the start of
your 'foreachRDD' method or equivalent. That's simple and works
assuming your batch
You didn't say what isn't serializable or where the exception occurs,
but, is it the same as this issue?
https://issues.apache.org/jira/browse/SPARK-4196
On Thu, Nov 6, 2014 at 5:42 AM, Vasu C vasuc.bigd...@gmail.com wrote:
Dear All,
I am getting java.io.NotSerializableException for below
I think Xiangrui's ALS code implement certain aspect of it. You may want to
check it out.
Best regards,
Wei
-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
From: Xiangrui Meng men...@gmail.com
To: Duy Huynh duy.huynh@gmail.com
I am connecting to a remote master using spark shell. Then I am getting
following error while trying to instantiate HiveContext.
scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
error: bad symbolic reference. A signature in HiveContext.class refers to
term hive
in package
80 matches
Mail list logo