Re: Matrix multiplication in spark

2014-11-05 Thread ll
@sowen.. i am looking for distributed operations, especially very large sparse matrix x sparse matrix multiplication. what is the best way to implement this in spark? -- View this message in context:

Re: SQL COUNT DISTINCT

2014-11-05 Thread Bojan Kostic
Here is the link on jira: https://issues.apache.org/jira/browse/SPARK-4243 https://issues.apache.org/jira/browse/SPARK-4243 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p18166.html Sent from the Apache Spark User List

Re: Streaming window operations not producing output

2014-11-05 Thread sivarani
hi TD, I would like to run streaming 24/7 and trying to use get or create but its not working please can you help on this http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-getOrCreate-tc18060.html -- View this message in context:

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Xiangrui Meng
local matrix-matrix multiplication or distributed? On Tue, Nov 4, 2014 at 11:58 PM, ll duy.huynh@gmail.com wrote: what is the best way to implement a sparse x sparse matrix multiplication with spark? -- View this message in context:

Re: Matrix multiplication in spark

2014-11-05 Thread Xiangrui Meng
We are working on distributed block matrices. The main JIRA is at: https://issues.apache.org/jira/browse/SPARK-3434 The goal is to support basic distributed linear algebra, (dense first and then sparse). -Xiangrui On Wed, Nov 5, 2014 at 12:23 AM, ll duy.huynh@gmail.com wrote: @sowen.. i

Re: java.io.NotSerializableException: org.apache.spark.SparkEnv

2014-11-05 Thread sivarani
Hi Thanks for replying, I have posted my code in http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-getOrCreate-tc18060.html -- View this message in context:

add support for separate GC log files for different executor

2014-11-05 Thread haitao .yao
Hey, guys. Here's my problem: While using the standalone mode, I always use the following args for executor: -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -Xloggc:/tmp/spark.executor.gc.log ​ But as we know, hotspot JVM does not support variable substitution on -Xloggc parameter, which

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Duy Huynh
distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix). thanks xiangrui! On Wed, Nov 5, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote: local matrix-matrix multiplication or distributed? On Tue, Nov 4, 2014 at 11:58 PM, ll duy.huynh@gmail.com wrote: what is

Re: Matrix multiplication in spark

2014-11-05 Thread Duy Huynh
ok great. when will this be ready? On Wed, Nov 5, 2014 at 4:27 AM, Xiangrui Meng men...@gmail.com wrote: We are working on distributed block matrices. The main JIRA is at: https://issues.apache.org/jira/browse/SPARK-3434 The goal is to support basic distributed linear algebra, (dense first

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Duy Huynh
in case, this won't be available anytime soon with spark. what would be a good way to implement this multiplication feature in spark? On Wed, Nov 5, 2014 at 4:59 AM, Duy Huynh duy.huynh@gmail.com wrote: distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix). thanks

Dynamically InferSchema From Hive and Create parquet file

2014-11-05 Thread Jahagirdar, Madhu
Currently the createParquetMethod needs BeanClass as one of the parameters. javahiveContext.createParquetFile(XBean.class, IMPALA_TABLE_LOC, true, new Configuration())

Change in the API for streamingcontext.actorStream?

2014-11-05 Thread Shiti Saxena
Has there been a change in Creating an input stream with an actor receiver? I was able to get it working with v1.0.1 but not with any other version after that. I tried doing so with EchoActor and get serialization errors. I have also reported an issue about this SPARK-4171

Re: pass unique ID to mllib algorithms pyspark

2014-11-05 Thread Tamas Jambor
Hi Xiangrui, Thanks for the reply. is this still due to be released in 1.2 (SPARK-3530 is still open)? Thanks, On Wed, Nov 5, 2014 at 3:21 AM, Xiangrui Meng men...@gmail.com wrote: The proposed new set of APIs (SPARK-3573, SPARK-3530) will address this issue. We carry over extra columns with

using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread rok
I have a dataset comprised of ~200k labeled points whose features are SparseVectors with ~20M features. I take 5% of the data for a training set. model = LogisticRegressionWithSGD.train(training_set) fails with ERROR:py4j.java_gateway:Error while sending or receiving. Traceback (most recent

Standalone Specify mem / cores defaults

2014-11-05 Thread Ashic Mahtab
Hi, The docs specify that we can control the amount of ram / cores available via: -c CORES, --cores CORESTotal CPU cores to allow Spark applications to use on the machine (default: all available); only on worker-m MEM, --memory MEMTotal amount of memory to allow Spark applications to use on the

Re: Standalone Specify mem / cores defaults

2014-11-05 Thread Akhil Das
You can set those inside the spark-defaults.conf file under the conf directory inside your spark installation. Thanks Best Regards On Wed, Nov 5, 2014 at 4:51 PM, Ashic Mahtab as...@live.com wrote: Hi, The docs specify that we can control the amount of ram / cores available via: -c CORES,

Unsubscribe

2014-11-05 Thread mrugen deshmukh
Thanks Regards, Mrugen Deshmukh. (M.S. Software Engineering - San Jose State University) [image: http://www.linkedin.com/in/mrugendeshmukh] http://www.linkedin.com/in/mrugendeshmukh

why decision trees do binary split?

2014-11-05 Thread jamborta
Hi, Just wondering what is the reason that the decision tree implementation in spark always does binary splits? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188.html Sent from the Apache Spark User List

Re: I want to make clear the difference about executor-cores number.

2014-11-05 Thread jamborta
If you go to your spark job UI (probably on http://master-node:4040), and click on the environment tab, you can check if the setting are correctly picked up by spark. Also when you run the job, you can see the subtasks (stages tab), inside the task you can check what resources are assigned to the

Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread jamborta
Hi Rok, you could try to debug it by first collecting your training_set, see if it gets you something back, before passing it to the train method. Then go through each line in the train method, also the serializer and check where it fails exactly. thanks, -- View this message in context:

Re: Unsubscribe

2014-11-05 Thread Akhil Das
To unsubscribe, send an email to user-unsubscr...@spark.apache.org Read more over here https://spark.apache.org/community.html. Thanks Best Regards On Wed, Nov 5, 2014 at 6:03 PM, mrugen deshmukh mrugenm...@gmail.com wrote: Thanks Regards, Mrugen Deshmukh. (M.S. Software Engineering - San

Re: Streaming window operations not producing output

2014-11-05 Thread diogo
Nothing on log4j logs. I figured it out by comparing my code to the examples. On Wed, Nov 5, 2014 at 4:17 AM, sivarani whitefeathers...@gmail.com wrote: hi TD, I would like to run streaming 24/7 and trying to use get or create but its not working please can you help on this

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-05 Thread jan.zikes
I have tried it out to merge the file to one, Spark is now working with RAM as I've expected. Unfortunately after doing this there appears another problem. Now Spark running on YARN is scheduling all the work only to one worker node as a one big job. Is there some way, how to force Spark and

Re: Spark Streaming getOrCreate

2014-11-05 Thread Yana
Siarani, does your spark-master look like it's still up (i.e. if you check the UI?). I cannot tell if you see this error on get or initial create. You can start debugging by dumping out the value of master in setMaster(master) -- especially if this failure is from the intial startup From the

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-05 Thread jan.zikes
Ok so the problem was solved, it that the file was gziped and it looks that Spark does not support direct .gz file distribution to workers.  Thank you very much fro the suggestion to merge the files. Best regards, Jan  __ I have

Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread rok
yes, the training set is fine, I've verified it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18195.html Sent from the Apache Spark User List mailing list archive at

Re: ERROR UserGroupInformation: PriviledgedActionException

2014-11-05 Thread Saiph Kappa
I am running the same version of spark in the server (master + worker) and in the client / driver. For the server I am using the binaries spark-1.1.0-bin-hadoop1 And in the client I am using the same version: dependency groupIdorg.apache.spark/groupId

Starting Spark Master on CDH5.2/Spark v1.1.0 fails. Indication is: 'SCALA_HOME is not set'

2014-11-05 Thread prismalytics
Hello Friends: I was temporarily using a manual build of Spark v1.1.0, until Cloudera CDH5 RPMs were updated to that latest version. So now I'm back to using the CDH5.2 Spark v1.1.0 distribution. That was just a preamble note for completeness. :) Now when I go to start the master as follows, it

Re: Snappy and spark 1.1

2014-11-05 Thread Aravind Srinivasan
Hi Guys, As part of debugging this native library error in our environment, it would be great if somebody can help me with this question. What kind of temp, scratch, and staging directories does Spark need and use on the slave nodes in the YARN cluster mode? Thanks, Aravind On Mon, Nov 3, 2014

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-05 Thread jan.zikes
Could you please give me an example or send me a link of how to use Hadoop CombinedFileInputFormat? It sound very interesting to me and it would probably save me several hours of my pipeline computation. Merging of the files is currently the bottleneck in my system.

Understanding spark operation pipeline and block storage

2014-11-05 Thread Hao Ren
Hi, I would like to understand the pipeline of spark's operation(transformation and action) and some details on block storage. Let's consider the following code: val rdd1 = SparkContext.textFile(hdfs://...) rdd1.map(func1).map(func2).count For example, we have a file in hdfs about 80Gb,

Any limitations of spark.shuffle.spill?

2014-11-05 Thread Yangcheng Huang
Hi One question about the power of spark.shuffle.spill - (I know this has been asked several times :-) Basically, in handling a (cached) dataset that doesn't fit in memory, Spark can spill it to disk. However, can I say that, when this is enabled, Spark can handle the situation faultlessly,

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Xiangrui Meng
You can use breeze for local sparse-sparse matrix multiplication and then define an RDD of sub-matrices RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix) and then use join and aggregateByKey to implement this feature, which is the same as in MapReduce. -Xiangrui

Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread Xiangrui Meng
Which Spark version did you use? Could you check the WebUI and attach the error message on executors? -Xiangrui On Wed, Nov 5, 2014 at 8:23 AM, rok rokros...@gmail.com wrote: yes, the training set is fine, I've verified it. -- View this message in context:

Re: Kafka Consumer in Spark Streaming

2014-11-05 Thread Something Something
As suggested by Qiaou, looked at the UI: 1) Under 'Stages' the only 'active' stage is: runJob at ReceiverTracker.scala:275 2) Under 'Executors', there's only 1 active task, but I don't see any output (or logs) 3) Under 'Streaming', there's one receiver called, 'KafkaReciever-0', but 'Records

Re: Dynamically InferSchema From Hive and Create parquet file

2014-11-05 Thread Michael Armbrust
That method is for creating a new directory to hold parquet data when there is no hive metastore available, thus you have to specify the schema. If you've already created the table in the metastore you can just query it using the sql method: javahiveConxted.sql(SELECT * FROM parquetTable); You

Re: Matrix multiplication in spark

2014-11-05 Thread Shivaram Venkataraman
We are working on a PRs to add block partitioned matrix formats and dense matrix multiply methods. This should be out in the next few weeks or so. The sparse methods still need some research on partitioning schemes etc. and we will do that after the dense methods are in place. Thanks Shivaram On

RE: Any Replicated RDD in Spark?

2014-11-05 Thread Shuai Zheng
Nice. Then I have another question, if I have a file (or a set of files: part-0, part-1, might be a few hundreds MB csv to 1-2 GB, created by other program), need to create hashtable from it, later broadcast it to each node to allow query (map side join). I have two options to do it: 1, I can

RE: Any Replicated RDD in Spark?

2014-11-05 Thread Shuai Zheng
And another similar case: If I have get a RDD from previous step, but for next step it should be a map side join (so I need to broadcast this RDD to every nodes). What is the best way for me to do that? Collect RDD in driver first and create broadcast? Or any shortcut in spark for this? Thanks!

Logging from the Spark shell

2014-11-05 Thread Ulanov, Alexander
Dear Spark users, I would like to run a long experiment using spark-shell. How can I log my intermediate results (numbers, strings) into some file on a master node? What are the best practices? It is NOT performance metrics of Spark that I want to log every X seconds. Instead, I would like to

AVRO specific records

2014-11-05 Thread Simone Franzini
How can I read/write AVRO specific records? I found several snippets using generic records, but nothing with specific records so far. Thanks, Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini

Partition sorting by Spark framework

2014-11-05 Thread nitinkak001
I need to sort my RDD partitions but the whole partition(s) might not fit into memory, so I cannot run the Collections Sort() method. Does Spark support partitions sorting by virtue of its framework? I am working on 1.1.0 version. I looked up similar unanswered question:

Re: AVRO specific records

2014-11-05 Thread Frank Austin Nothaft
Hi Simone, Matt Massie put together a good tutorial on his blog. If you’re looking for more code using Avro, we use it pretty extensively in our genomics project. Our Avro schemas are here, and we have serialization code here. We use Parquet for storing the Avro records, but there is also an

Re: AVRO specific records

2014-11-05 Thread Laird, Benjamin
Something like this works and is how I create an RDD of specific records. val avroRdd = sc.newAPIHadoopFile(twitter.avro, classOf[AvroKeyInputFormat[twitter_schema]], classOf[AvroKey[twitter_schema]], classOf[NullWritable], conf) (From

Question regarding sorting and grouping

2014-11-05 Thread Ping Tang
Hi, I’m working on an use case using Spark streaming. I need to process a RDD of strings so that they will be grouped by IP and sorted by time. Could somebody tell me the right transformation? Input: 2014-10-23 08:18:38,904 [192.168.10.1] 2014-10-23 08:18:38,907 [192.168.10.1] ccc

cache function is not working on RDD from parallelize

2014-11-05 Thread Edwin
Hi, On a 5 node cluster, say I have data on the driver application node, and then I call parallelize on the data, I get a rdd back. However, when I call cache on the rdd the rdd won't be cached (I checked that through timing on count the realized-cached rdd, take as long as before it was

Configuring custom input format

2014-11-05 Thread Corey Nolet
I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD. Creating the new RDD works fine but setting up the configuration file via the static methods on input formats that require a Hadoop Job object is proving to be difficult. Trying to new up my own Job object with the

Re: Configuring custom input format

2014-11-05 Thread Corey Nolet
The closer I look @ the stack trace in the Scala shell, it appears to be the call to toString() that is causing the construction of the Job object to fail. Is there a ways to suppress this output since it appears to be hindering my ability to new up this object? On Wed, Nov 5, 2014 at 5:49 PM,

[SQL] PERCENTILE is not working

2014-11-05 Thread Kevin Paul
Hi all, I encounter this error when execute the query sqlContext.sql(select percentile(age, array(0, 0.5, 1)) from people).collect() java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to [Ljava.lang.Object; at

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Reynold Xin
Hi all, We are excited to announce that the benchmark entry has been reviewed by the Sort Benchmark committee and Spark has officially won the Daytona GraySort contest in sorting 100TB of data. Our entry tied with a UCSD research team building high performance systems and we jointly set a new

Re: with SparkStreeaming spark-submit, don't see output after ssc.start()

2014-11-05 Thread spr
This problem turned out to be a cockpit error. I had the same class name defined in a couple different files, and didn't realize SBT was compiling them all together, and then executing the wrong one. Mea culpa. -- View this message in context:

Re: Any Replicated RDD in Spark?

2014-11-05 Thread Matei Zaharia
If you start with an RDD, you do have to collect to the driver and broadcast to do this. Between the two options you listed, I think this one is simpler to implement, and there won't be a huge difference in performance, so you can go for it. Opening InputStreams to a distributed file system by

how to blend a DStream and a broadcast variable?

2014-11-05 Thread spr
My use case has one large data stream (DS1) that obviously maps to a DStream. The processing of DS1 involves filtering it for any of a set of known values, which will change over time, though slowly by streaming standards. If the filter data were static, it seems to obviously map to a broadcast

Re: AVRO specific records

2014-11-05 Thread Anand Iyer
You can also use the Kite SDK to read/write Avro records: https://github.com/kite-sdk/kite-examples/tree/master/spark - Anand On Wed, Nov 5, 2014 at 2:24 PM, Laird, Benjamin benjamin.la...@capitalone.com wrote: Something like this works and is how I create an RDD of specific records. val

How to trace/debug serialization?

2014-11-05 Thread ankits
In my spark job, I have a loop something like this: bla.forEachRdd(rdd = { //init some vars rdd.forEachPartition(partiton = { //init some vars partition.foreach(kv = { ... I am seeing serialization errors (unread block data), because I think spark is trying to serialize the

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia
Congrats to everyone who helped make this happen. And if anyone has even more machines they'd like us to run on next year, let us know :). Matei On Nov 5, 2014, at 3:11 PM, Reynold Xin r...@databricks.com wrote: Hi all, We are excited to announce that the benchmark entry has been

How to get Spark User List Digests only, but still be able to post questions ...

2014-11-05 Thread prismalytics
Hello Friends: I cringe to ask this off-topic question (so forgive me in advance). I'm trying to figure out how to receive only the digest email for this Spark User List, yet still be able to email questions to it. Subscribing to the 'user-dig...@spark.incubator.apache.org' alias does provide

SparkContext._lock Error

2014-11-05 Thread Pagliari, Roberto
I'm using this system Hadoop 1.0.4 Scala 2.9.3 Hive 0.9.0 With spark 1.1.0. When importing pyspark, I'm getting this error: from pyspark.sql import * Traceback (most recent call last): File stdin, line 1, in ? File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in ? from

Re: SparkContext._lock Error

2014-11-05 Thread Davies Liu
What's the version of Python? 2.4? Davies On Wed, Nov 5, 2014 at 4:21 PM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I’m using this system Hadoop 1.0.4 Scala 2.9.3 Hive 0.9.0 With spark 1.1.0. When importing pyspark, I’m getting this error: from pyspark.sql import *

Spark SQL Hive Version

2014-11-05 Thread Cheng, Hao
Hi, all, I noticed that when compiling the SparkSQL with profile hive-0.13.1, it will fetch the Hive version of 0.13.1a under groupId org.spark-project.hive, what's the difference with the one of org.apache.hive? And where can I get the source code for re-compiling? Thanks, Cheng Hao

RE: [SQL] PERCENTILE is not working

2014-11-05 Thread Cheng, Hao
Which version are you using? I can reproduce that in the latest code, but with different exception. I've filed an bug https://issues.apache.org/jira/browse/SPARK-4263, can you also add some information there? Thanks, Cheng Hao -Original Message- From: Kevin Paul

log4j logging control via sbt

2014-11-05 Thread Simon Hafner
I've tried to set the log4j logger to warn only via log4j properties file in cat src/test/resources/log4j.properties log4j.logger.org.apache.spark=WARN or in sbt via javaOptions += -Dlog4j.logger.org.apache.spark=WARN But the logger still gives me INFO messages to stdout when I run my tests

Errors in Spark streaming application due to HDFS append

2014-11-05 Thread Ping Tang
Hi All, I’m trying to write streaming processed data in HDFS (Hadoop 2). The buffer is flushed and closed after each writing. The following errors occurred when opening the same file to append. I know for sure the error is caused by closing the file. Any idea? Here is the code to write HDFS

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-11-05 Thread Marcelo Vanzin
On Mon, Oct 27, 2014 at 7:37 PM, buring qyqb...@gmail.com wrote: Here is error log,I abstract as follows: INFO [binaryTest---main]: before first WARN [org.apache.spark.scheduler.TaskSetManager---Result resolver thread-0]: Lost task 0.0 in stage 0.0 (TID 0, spark-dev136):

Task size variation while using Range Vs List

2014-11-05 Thread nsareen
I noticed a behaviour where it was observed that, if i'm using val temp = sc.parallelize ( 1 to 10) temp.collect Task size will be in bytes let's say ( 1120 bytes). But if i change this to a for loop import scala.collection.mutable.ArrayBuffer val data= new ArrayBuffer[Integer]() for(i -

Re: How to trace/debug serialization?

2014-11-05 Thread nsareen
From what i've observed, there are no debug logs while serialization takes place. You can see the source code if you want, TaskSetManager class has some functions for serialization. -- View this message in context:

Re: [SQL] PERCENTILE is not working

2014-11-05 Thread Yin Huai
Hello Kevin, https://issues.apache.org/jira/browse/SPARK-3891 will fix this bug. Thanks, Yin On Wed, Nov 5, 2014 at 8:06 PM, Cheng, Hao hao.ch...@intel.com wrote: Which version are you using? I can reproduce that in the latest code, but with different exception. I've filed an bug

Re: Spark SQL Hive Version

2014-11-05 Thread Zhan Zhang
The original spark-project hive-0.13.1 has some problem with packaging causing version conflicts, and hive-0.13.1a is repackaged to solve the problem. They share the same official hive source code release 0.13.1, with unnecessary package removed from the original official hive release package.

JavaStreamingContextFactory checkpoint directory NotSerializableException

2014-11-05 Thread Vasu C
Dear All, I am getting java.io.NotSerializableException for below code. if jssc.checkpoint(HDFS_CHECKPOINT_DIR); is blocked there is not exception Please help JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() { @Override public JavaStreamingContext create() {

Re: Task size variation while using Range Vs List

2014-11-05 Thread Shixiong Zhu
For 2), If the input is Range, Spark only needs the start value and the end value for each partition, so the overhead of Range is little. But for ArrayBuffer, Spark needs to serialize all of the data into the task. That's why it's huge in your case. For 1), Spark does not always travel the data

Re: Any limitations of spark.shuffle.spill?

2014-11-05 Thread Shixiong Zhu
Two limitations we found here: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-in-quot-cogroup-quot-td17349.html Best Regards, Shixiong Zhu 2014-11-06 2:04 GMT+08:00 Yangcheng Huang yangcheng.hu...@huawei.com: Hi One question about the power of spark.shuffle.spill – (I

Re: Submiting Spark application through code

2014-11-05 Thread sivarani
Thanks boss its working :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submiting-Spark-application-through-code-tp17452p18250.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming: foreachRDD network output

2014-11-05 Thread sivarani
Any one, any luck? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-foreachRDD-network-output-tp15205p18251.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: log4j logging control via sbt

2014-11-05 Thread Akhil Das
How about adding the following in your $SPARK_HOME/conf/log4j.properties file? # Set WARN to be logged to the console *log4j.rootCategory=WARN, console* log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err

Re: How to trace/debug serialization?

2014-11-05 Thread Shixiong Zhu
This is more about mechanism of Scala compiler and Java serialization. By default, Java will serialize an object deeply and recursively. Secondly, how Scala compiler generates the byte codes does matter. I'm not a Scala expert. Here is just some observation: 1. If the function does not use any

Number cores split up

2014-11-05 Thread Naveen Kumar Pokala
Hi, I have a 2 node yarn cluster and I am using spark 1.1.0 to submit my tasks. As per the documentation of spark, number of cores are maximum cores available. So does it mean each node creates no of cores = no of threads to process the job assigned to that node. For ex, ListInteger

Re: how to blend a DStream and a broadcast variable?

2014-11-05 Thread Sean Owen
Broadcast vars should work fine in Spark streaming. Broadcast vars are immutable however. If you have some info to cache which might change from batch to batch, you should be able to load it at the start of your 'foreachRDD' method or equivalent. That's simple and works assuming your batch

Re: JavaStreamingContextFactory checkpoint directory NotSerializableException

2014-11-05 Thread Sean Owen
You didn't say what isn't serializable or where the exception occurs, but, is it the same as this issue? https://issues.apache.org/jira/browse/SPARK-4196 On Thu, Nov 6, 2014 at 5:42 AM, Vasu C vasuc.bigd...@gmail.com wrote: Dear All, I am getting java.io.NotSerializableException for below

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Wei Tan
I think Xiangrui's ALS code implement certain aspect of it. You may want to check it out. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center From: Xiangrui Meng men...@gmail.com To: Duy Huynh duy.huynh@gmail.com

Unable to use HiveContext in spark-shell

2014-11-05 Thread tridib
I am connecting to a remote master using spark shell. Then I am getting following error while trying to instantiate HiveContext. scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) error: bad symbolic reference. A signature in HiveContext.class refers to term hive in package