sparse x sparse matrix multiplication

2014-11-04 Thread ll
what is the best way to implement a sparse x sparse matrix multiplication with spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sparse-x-sparse-matrix-multiplication-tp18163.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Best practice for join

2014-11-04 Thread Akhil Das
Oh, in that case, if you want to reduce the GC time, you can specify the level of parallelism along with your join, reduceByKey operations. Thanks Best Regards On Wed, Nov 5, 2014 at 1:11 PM, Benyi Wang wrote: > I'm using spark-1.0.0 in CDH 5.1.0. The big problem is SparkSQL doesn't > support H

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
Done, JIRA link: https://issues.apache.org/jira/browse/SPARK-4241 Thanks. 2014-11-05 10:58 GMT+08:00 Nicholas Chammas : > Oh, I can see that region via boto as well. Perhaps the doc is indeed out > of date. > > Do you mind opening a JIRA issue >

Re: Best practice for join

2014-11-04 Thread Benyi Wang
I'm using spark-1.0.0 in CDH 5.1.0. The big problem is SparkSQL doesn't support Hash join in this version. On Tue, Nov 4, 2014 at 10:54 PM, Akhil Das wrote: > How about Using SparkSQL ? > > Thanks > Best Regards > > On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang wrote

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Akhil Das
Your code doesn't trigger any action. How about the following? JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(60 * 1 * 1000)); JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, ":2181", "1", map); JavaDStream statuses = tweets.map(

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something
It's not local. My spark url is something like this: String sparkUrl = "spark://:7077"; On Tue, Nov 4, 2014 at 11:03 PM, Jain Rahul wrote: > > I think you are running it locally. > Do you have local[1] here for master url? If yes change it to local[2] or > more number of threads. > I

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Jain Rahul
I think you are running it locally. Do you have local[1] here for master url? If yes change it to local[2] or more number of threads. It may be due to topic name mismatch also. sparkConf.setMaster(“local[1]"); Regards, Rahul From: Something Something mailto:mailinglist...@gmail.com>> Date

Re: Best practice for join

2014-11-04 Thread Akhil Das
How about Using SparkSQL ? Thanks Best Regards On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang wrote: > I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did, > > # build (K,V) from A and B to prepare the join > > val ja = A.map( r => (K1, Va)) > val jb = B.map

Re: GraphX and Spark

2014-11-04 Thread Kamal Banga
GraphX is build on *top* of Spark, so Spark can achieve whatever GraphX can. On Wed, Nov 5, 2014 at 9:41 AM, Deep Pradhan wrote: > Hi, > Can Spark achieve whatever GraphX can? > Keeping aside the performance comparison between Spark and GraphX, if I > want to implement any graph algorithm and I

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something
Added foreach as follows. Still don't see any output on my console. Would this go to the worker logs as Jerry indicated? JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, ":2181", "1", map); JavaDStream statuses = tweets.map( new Function() {

How to increase hdfs read parallelism

2014-11-04 Thread Rajat Verma
Hi I have simple use case where I have to join two feeds. I have two worker nodes each having 96 GB memory and 24 cores. I am running spark(1.1.0) with yarn(2.4.0). I have allocated 80% resources to spark queue and my spark config looks like spark.executor.cores=18 spark.executor.memory=66g spark.e

Re: save as JSON objects

2014-11-04 Thread Akhil Das
Something like this? val json = myRDD.map(*map_obj* => new JSONObject(*map_obj*)) ​Here map_obj will be a map containing values (eg: *Map("name" -> "Akhil", "mail" -> "xyz@xyz")*)​ Performance wasn't so good with this one though. Thanks Best Regards On Wed, Nov 5, 2014 at 3:02 AM, Yin Huai wr

Re: stackoverflow error

2014-11-04 Thread Sean Owen
With so many iterations, your RDD lineage is too deep. You should not need nearly so many iterations. 10 or 20 is usually plenty. On Tue, Nov 4, 2014 at 11:13 PM, Hongbin Liu wrote: > Hi, can you help with the following? We are new to spark. > > > > Error stack: > > > > 14/11/04 18:08:03 INFO Spa

Re: Issue in Spark Streaming

2014-11-04 Thread Akhil Das
Which error are you referring here? Can you paste the error logs? Thanks Best Regards On Wed, Nov 5, 2014 at 11:04 AM, Suman S Patil wrote: > I am trying to run the Spark streaming program as given in the Spark > streaming Programming guide >

RE: Kafka Consumer in Spark Streaming

2014-11-04 Thread Shao, Saisai
If you’re running on a standalone mode, the log is under /work/ directory. I’m not sure for yarn or mesos, you can check the document of Spark to see the details. Thanks Jerry From: Something Something [mailto:mailinglist...@gmail.com] Sent: Wednesday, November 05, 2014 2:28 PM To: Shao, Saisai

Re: ERROR UserGroupInformation: PriviledgedActionException

2014-11-04 Thread Akhil Das
Its more like you are having different versions of spark Thanks Best Regards On Wed, Nov 5, 2014 at 3:05 AM, Saiph Kappa wrote: > I set the host and port of the driver and now the error slightly changed > > Using Spark's default log4j profile: >> org/apache/spark/log4j-defaults.properties >> 14

RE: MEMORY_ONLY_SER question

2014-11-04 Thread Shao, Saisai
From my understanding, the Spark code use Kryo as a streaming manner for RDD partitions, the deserialization comes with iteration to move forward. But the internal thing of Kryo to deserialize all the object once or incrementally is actually a behavior of Kryo, I guess Kyro will not deserialize

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Sean Owen
this code only expresses a transformation and so does not actually cause any action. I think you intend to use foreachRDD. On Wed, Nov 5, 2014 at 5:57 AM, Something Something wrote: > I've following code in my program. I don't get any error, but it's not > consuming the messages either. Shouldn

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something
The Kafka broker definitely has messages coming in. But your #2 point is valid. Needless to say I am a newbie to Spark. I can't figure out where the 'executor' logs would be. How would I find them? All I see printed on my screen is this: 14/11/04 22:21:23 INFO Slf4jLogger: Slf4jLogger started

RE: Kafka Consumer in Spark Streaming

2014-11-04 Thread Shao, Saisai
Hi, would you mind describing your problem a little more specific. 1. Is the Kafka broker currently has no data feed in? 2. This code will print the lines, but not in the driver side, the code is running in the executor side, so you can check the log in worker dir to see if there’s a

Re: MEMORY_ONLY_SER question

2014-11-04 Thread Mohit Jaggi
I used the word "streaming" but I did not mean to refer to spark streaming. I meant if a partition containing 10 objects was kryo-serialized into a single buffer, then in a mapPartitions() call, as I call iter.next() 10 times to access these objects one at a time, does the deserialization happen a)

Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something
I've following code in my program. I don't get any error, but it's not consuming the messages either. Shouldn't the following code print the line in the 'call' method? What am I missing? Please help. Thanks. JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration

Issue in Spark Streaming

2014-11-04 Thread Suman S Patil
I am trying to run the Spark streaming program as given in the Spark streaming Programming guide, in the interactive shell. I am getting an error as shown here as an intermediate step. It resumes the run on its own like th

Re: Spark Streaming getOrCreate

2014-11-04 Thread sivarani
Anybody any luck? I am also trying to set NONE to delete key from state, will null help? how to use scala none in java My code goes this way public static class ScalaLang { public static Option none() { return (Option) None$.MODULE$; } }

MLlib and PredictionIO sample code

2014-11-04 Thread Simon Chan
Hey guys, I have written a tutorial on deploying MLlib's models on production with open source PredictionIO: http://docs.prediction.io/0.8.1/templates/ The goal is to add the following features to MLlib, with production application in mind: - JSON query to retrieve prediction online - Separation-

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi
Thanks Michael for your response. Just now, i saw saveAsTable method on JavaSchemaRDD object (in Spark 1.1.0 API). But I couldn't find the corresponding documentation. Will that help? Please let me know. Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560

GraphX and Spark

2014-11-04 Thread Deep Pradhan
Hi, Can Spark achieve whatever GraphX can? Keeping aside the performance comparison between Spark and GraphX, if I want to implement any graph algorithm and I do not want to use GraphX, can I get the work done with Spark? Than You

Re: pass unique ID to mllib algorithms pyspark

2014-11-04 Thread Xiangrui Meng
The proposed new set of APIs (SPARK-3573, SPARK-3530) will address this issue. We "carry over" extra columns with training and prediction and then leverage on Spark SQL's execution plan optimization to decide which columns are really needed. For the current set of APIs, we can add `predictOnValues`

Re: stdout in spark applications

2014-11-04 Thread lokeshkumar
Got my answer from this thread, http://apache-spark-user-list.1001560.n3.nabble.com/no-stdout-output-from-worker-td2437.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stdout-in-spark-applications-tp18056p18134.html Sent from the Apache Spark User List

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread Nicholas Chammas
Oh, I can see that region via boto as well. Perhaps the doc is indeed out of date. Do you mind opening a JIRA issue to track this request? I can do it if you've never opened a JIRA issue before. Nick On Tue, Nov 4, 2014 at 9:03 PM, haitao .y

Re: Spark v Redshift

2014-11-04 Thread Vladimir Rodionov
>> We service templated queries from the appserver, i.e. user fills >>out some forms, dropdowns: we translate to a query. and >>The target data >>size is about a billion records, 20'ish fields, distributed throughout a >>year (about 50GB on disk as CSV, uncompressed). tells me that proprietary i

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
I'm afraid not. We have been using EC2 instances in cn-north-1 region for a while. And the latest version of boto has added the region: cn-north-1 Here's the screenshot: from boto import ec2 >>> ec2.regions() [RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1, RegionInfo

Re: Why mapred for the HadoopRDD?

2014-11-04 Thread raymond
You could take a look at sc.newAPIHadoopRDD() 在 2014年11月5日,上午9:29,Corey Nolet 写道: > I'm fairly new to spark and I'm trying to kick the tires with a few > InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf > instead of a MapReduce Job object. Is there future planned suppo

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread Nicholas Chammas
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html cn-north-1 is not a supported region for EC2, as far as I can tell. There may be other AWS services that can use that region, but spark-ec2 relies on EC2. Nick On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao wr

Re: Using SQL statements vs. SchemaRDD methods

2014-11-04 Thread Michael Armbrust
They both compile down to the same logical plans so the performance of running the query should be the same. The Scala DSL uses a lot of Scala magic and thus is experimental where as HiveQL is pretty set in stone. On Tue, Nov 4, 2014 at 5:22 PM, SK wrote: > SchemaRDD supports some of the SQL-l

Re: Spark v Redshift

2014-11-04 Thread agfung
Sounds like context would help, I just didn't want to subject people to a wall of text if it wasn't necessary :) Currently we use neither Spark SQL (or anything in the Hadoop stack) or Redshift. We service templated queries from the appserver, i.e. user fills out some forms, dropdowns: we transla

Why mapred for the HadoopRDD?

2014-11-04 Thread Corey Nolet
I'm fairly new to spark and I'm trying to kick the tires with a few InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf instead of a MapReduce Job object. Is there future planned support for the mapreduce packaging?

Re: netty on classpath when using spark-submit

2014-11-04 Thread Tobias Pfeiffer
Markus, thanks for your help! On Tue, Nov 4, 2014 at 8:33 PM, M. Dale wrote: > Tobias, >From http://spark.apache.org/docs/latest/configuration.html it seems > that there is an experimental property: > > spark.files.userClassPathFirst > Thank you very much, I didn't know about this. Unfor

Using SQL statements vs. SchemaRDD methods

2014-11-04 Thread SK
SchemaRDD supports some of the SQL-like functionality like groupBy(), distinct(), select(). However, SparkSQL also supports SQL statements which provide this functionality. In terms of future support and performance, is it better to use SQL statements or the SchemaRDD methods that provide equivale

spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
Hi, Amazon aws started to provide service for China mainland, the region name is cn-north-1. But the script spark provides: spark_ec2.py will query ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and there's no ami information for cn-north-1 region . Can anybody update the ami

Re: deploying a model built in mllib

2014-11-04 Thread Simon Chan
The latest version of PredictionIO, which is now under Apache 2 license, supports the deployment of MLlib models on production. The "engine" you build will including a few components, such as: - Data - includes Data Source and Data Preparator - Algorithm(s) - Serving I believe that you can do the

RE: Workers not registering after master restart

2014-11-04 Thread Ashic Mahtab
Hi Nan,Cool. Thanks. Regards,Ashic. Date: Tue, 4 Nov 2014 18:26:48 -0500 From: zhunanmcg...@gmail.com To: as...@live.com CC: user@spark.apache.org Subject: Re: Workers not registering after master restart Hi, Ashic, this is expected for the latest released ve

Re: Spark v Redshift

2014-11-04 Thread Akshar Dave
There is no one size fits all solution available in the market today. If somebody tell you they do then they are simply lying :) Both solutions cater to different set of problems. My recommendation is to put real focus on getting better understanding of your problems that you are trying to solve w

Re: Spark v Redshift

2014-11-04 Thread Jimmy McErlain
This is pretty spot on.. though I would also add that the Spark features that it touts around speed are all dependent on caching the data into memory... reading off the disk still takes time..ie pulling the data into an RDD. This is the reason that Spark is great for ML... the data is used over an

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
BTW while I haven't actually used Redshift, I've seen many companies that use both, usually using Spark for ETL and advanced analytics and Redshift for SQL on the cleaned / summarized data. Xiangrui Meng also wrote https://github.com/mengxr/redshift-input-format to make it easy to read data exp

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use S

Re: How to ship cython library to workers?

2014-11-04 Thread freedafeng
Thanks for the solution! I did figure out how to create an .egg file to ship out to the workers. Using ipython seems to be another cool solution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467p18116.html Sent fr

RE: stackoverflow error

2014-11-04 Thread Hongbin Liu
Sorry, I have to change rank/lambda/iteration to the following val ranks = List(100, 200, 400) val lambdas = List(1.0, 2.0, 4.0) val numIters = List(50, 100, 150) From: Hongbin Liu Sent: Tuesday, November 04, 2014 6:14 PM To: 'user@spark.apache.org' Cc: Gregory Campbell Subject: stac

stackoverflow error

2014-11-04 Thread Hongbin Liu
Hi, can you help with the following? We are new to spark. Error stack: 14/11/04 18:08:03 INFO SparkContext: Job finished: count at ALS.scala:314, took 480.318100288 s Exception in thread "main" java.lang.StackOverflowError at org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.sc

Re: Workers not registering after master restart

2014-11-04 Thread Nan Zhu
Hi, Ashic, this is expected for the latest released version However, workers should be able to re-register since 1.2, since this patch https://github.com/apache/spark/pull/2828 was merged Best, -- Nan Zhu On Tuesday, November 4, 2014 at 6:00 PM, Ashic Mahtab wrote: > Hi, > I've set up a

Spark v Redshift

2014-11-04 Thread agfung
I'm in the midst of a heated debate about the use of Redshift v Spark with a colleague. We keep trading anecdotes and links back and forth (eg airbnb post from 2013 or amplab benchmarks), and we don't seem to be getting anywhere. So before we start down the prototype /benchmark road, and in desp

Workers not registering after master restart

2014-11-04 Thread Ashic Mahtab
Hi,I've set up a standalone Spark master (no failover or file recovery specified), and brought up a few worker nodes. All of them registered and were shown in the master web UI. I then stopped and started the master service (the workers were still running). After the master started up, I checked

Re: spark sql create nested schema

2014-11-04 Thread Yin Huai
Hello Tridib, For you case, you can use StructType(StructField("ParentInfo", parentInfo, true) :: StructField("ChildInfo", childInfo, true) :: Nil) to create the StructType representing the schema (parentInfo and childInfo are two existing StructTypes). You can take a look at our docs ( http://spa

Re: Streaming window operations not producing output

2014-11-04 Thread Tathagata Das
Didnt oyu get any errors in the log4j logs, saying that you have to enable checkpointing? TD On Tue, Nov 4, 2014 at 7:20 AM, diogo wrote: > So, to answer my own n00b question, if case anyone ever needs it. You have > to enable checkpointing (by ssc.checkpoint(hdfsPath)). Windowed > operations n

Re: How to make sure a ClassPath is always shipped to workers?

2014-11-04 Thread Peng Cheng
Thanks a lot! Unfortunately this is not my problem: The page class is already in the jar that is shipped to every worker. (I've logged into workers and unpacked the jar files, and see the class file right there as intended) Also, this error only happens sporadically, not every time. the error was s

Re: Model characterization

2014-11-04 Thread vinay453
Go it from a friend - println(model.weights) and println(model.intercept). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Model-characterization-tp17985p18106.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: scala RDD sortby compilation error

2014-11-04 Thread Sean Owen
That works for me in the shell, at least, without the streaming bit and whatever other code you had before this. Did you import scala.reflect.classTag? I think you'd get a different error if not. Maybe remove the "foreachFunc ="? On Tue, Nov 4, 2014 at 9:11 PM, Josh J wrote: > Please find my code

Re: ERROR UserGroupInformation: PriviledgedActionException

2014-11-04 Thread Saiph Kappa
I set the host and port of the driver and now the error slightly changed Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 14/11/04 21:13:48 INFO CoarseGrainedExecutorBackend: Registered signal > handlers for [TERM, HUP, INT] > 14/11/04 21:13:48 INFO SecurityManag

Re: save as JSON objects

2014-11-04 Thread Yin Huai
Hello Andrejs, For now, you need to use a JSON lib to serialize records of your datasets as JSON strings. In future, we will add a method to SchemaRDD to let you write a SchemaRDD in JSON format (I have created https://issues.apache.org/jira/browse/SPARK-4228 to track it). Thanks, Yin On Tue, N

Re: IllegalStateException: unread block data

2014-11-04 Thread freedafeng
problem is solved. I basically built a fat spark jar that includes all hbase stuff and sent over the examples.jar over to the slaves too. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/IllegalStateException-unread-block-data-tp18011p18102.html Sent from th

Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
Please find my code here . On Tue, Nov 4, 2014 at 11:33 AM, Josh J wrote: > I'm using the same code >

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Steve Reinhardt
From: Sean Owen >Maybe you are looking for updateStateByKey? >http://spark.apache.org/docs/latest/streaming-programming-guide.html#trans >formations-on-dstreams > >You can use broadcast to efficiently send info to all the workers, if >you have some other data that's immutable, like in a local fil

Re: MEMORY_ONLY_SER question

2014-11-04 Thread Tathagata Das
It it deserialized in a streaming manner as the iterator moves over the partition. This is a functionality of core Spark, and Spark Streaming just uses it as is. What do you want to customize it to? On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi wrote: > Folks, > If I have an RDD persisted in MEMOR

Best practice for join

2014-11-04 Thread Benyi Wang
I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did, # build (K,V) from A and B to prepare the join val ja = A.map( r => (K1, Va)) val jb = B.map( r => (K1, Vb)) # join A, B val jab = ja.join(jb) # build (K,V) from the joined result of A and B to prepare joining with C val jc = C.ma

Re: StructField of StructType

2014-11-04 Thread Michael Armbrust
Structs are Rows nested in other rows. This might also be helpful: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema On Tue, Nov 4, 2014 at 12:21 PM, tridib wrote: > How do I create a StructField of StructType? I need to create a nested > sche

StructField of StructType

2014-11-04 Thread tridib
How do I create a StructField of StructType? I need to create a nested schema. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

spark sql create nested schema

2014-11-04 Thread tridib
I am trying to create a schema which will look like: root |-- ParentInfo: struct (nullable = true) ||-- ID: string (nullable = true) ||-- State: string (nullable = true) ||-- Zip: string (nullable = true) |-- ChildInfo: struct (nullable = true) ||-- ID: string (nullable = tru

[ANN] Spark resources searchable

2014-11-04 Thread Otis Gospodnetic
Hi everyone, We've recently added indexing of all Spark resources to http://search-hadoop.com/spark . Everything is nicely searchable: * user & dev mailing lists * JIRA issues * web site * wiki * source code * javadoc. Maybe it's worth adding to http://spark.apache.org/community.html ? Enjoy!

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Sean Owen
On Tue, Nov 4, 2014 at 8:02 PM, spr wrote: > To state this another way, it seems like there's no way to straddle the > streaming world and the non-streaming world; to get input from both a > (vanilla, Linux) file and a stream. Is that true? > > If so, it seems I need to turn my (vanilla file) da

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Steve Reinhardt
-Original Message- From: Sean Owen >On Tue, Nov 4, 2014 at 8:02 PM, spr wrote: >> To state this another way, it seems like there's no way to straddle the >> streaming world and the non-streaming world; to get input from both a >> (vanilla, Linux) file and a stream. Is that true? >> >>

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Sean Owen
Maybe you are looking for updateStateByKey? http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams You can use broadcast to efficiently send info to all the workers, if you have some other data that's immutable, like in a local file, that needs to be distr

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread spr
Good, thanks for the clarification. It would be great if this were precisely stated somewhere in the docs. :) To state this another way, it seems like there's no way to straddle the streaming world and the non-streaming world; to get input from both a (vanilla, Linux) file and a stream. Is tha

Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Terry Siu
Done. https://issues.apache.org/jira/browse/SPARK-4226 Hoping this will make it into 1.3? :) -Terry From: Michael Armbrust mailto:mich...@databricks.com>> Date: Tuesday, November 4, 2014 at 11:31 AM To: Terry Siu mailto:terry@smartfocus.com>> Cc: "user@spark.apache.org

Re: avro + parquet + vector + NullPointerException while reading

2014-11-04 Thread Michael Armbrust
You might consider using the native parquet support built into Spark SQL instead of using the raw library: http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files On Mon, Nov 3, 2014 at 7:33 PM, Michael Albert < m_albert...@yahoo.com.invalid> wrote: > Greetings! > > > I'm tr

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Sean Owen
Yes, code is just local Scala code unless it's invoking Spark APIs. The "non-Spark-streaming" block appears to just be normal program code executed in your driver, which ultimately starts the streaming machinery later. It executes once; there is nothing about that code connected to Spark. It's not

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread Michael Armbrust
Temporary tables are local to the context that creates them (just like RDDs). I'd recommend saving the data out as Parquet to share it between contexts. On Tue, Nov 4, 2014 at 3:18 AM, vdiwakar.malladi wrote: > Hi, > > There is a need in my application to query the loaded data into > sparkconte

Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
I'm using the same code , though still receive not enough arguments for method sortBy: (f: String => K, ascending: Boolean, numPartitions: Int)(implicit ord:

Re: Spark SQL takes unexpected time

2014-11-04 Thread Michael Armbrust
People also store data off-heap by putting parquet data into Tachyon. The optimization in 1.2 is to use the in-memory columnar cached format instead of keeping row objects (and their boxed contents) around when you call .cache(). This significantly reduces the number of live objects. (since you h

Fwd: Master example.MovielensALS

2014-11-04 Thread Debasish Das
Hi, I just built the master today and I was testing the IR metrics (MAP and prec@k) on Movielens data to establish a baseline... I am getting a weird error which I have not seen before: MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example mllib.MovieLensALS --kryo --lambda 0.065 hdfs://lo

Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Michael Armbrust
This is not supported yet. It would be great if you could open a JIRA (though I think apache JIRA is down ATM). On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu wrote: > I’m trying to execute a subquery inside an IN clause and am encountering > an unsupported language feature in the parser. > > java

scala RDD sortby compilation error

2014-11-04 Thread Josh J
Hi, Does anyone have any good examples of using sortby for RDDs and scala? I'm receiving not enough arguments for method sortBy: (f: String => K, ascending: Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag: scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String]. Unspec

What's wrong with my settings about shuffle/storage.memoryFraction

2014-11-04 Thread Benyi Wang
I don't need to cache RDDs in my spark Application, but there is a big shuffle in the data processing. I can always find Shuffle spill (memory) and Shuffle spill (disk). I'm wondering if I can give more memory to shuffle to avoid spill to disk. export SPARK_JAVA_OPTS='-Dspark.shuffle.memoryFractio

Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

2014-11-04 Thread spr
Holden Karau wrote > This is the expected behavior. Spark Streaming only reads new files once, > this is why they must be created through an atomic move so that Spark > doesn't accidentally read a partially written file. I'd recommend looking > at "Basic Sources" in the Spark Streaming guide ( > ht

Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

2014-11-04 Thread Holden Karau
This is the expected behavior. Spark Streaming only reads new files once, this is why they must be created through an atomic move so that Spark doesn't accidentally read a partially written file. I'd recommend looking at "Basic Sources" in the Spark Streaming guide ( http://spark.apache.org/docs/la

Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

2014-11-04 Thread spr
I am trying to implement a use case that takes some human input. Putting that in a single file (as opposed to a collection of HDFS files) would be a simpler human interface, so I tried an experiment with whether Spark Streaming (via textFileStream) will recognize a new version of a filename it has

Re: Got java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s when running job from intellij Idea

2014-11-04 Thread Sean Owen
Hadoop is certainly bringing them back in. You should mark all Hadoop and Spark deps as "provided" to not even build them into your app. On Tue, Nov 4, 2014 at 4:49 PM, Jaonary Rabarisoa wrote: > I don't understand why since there's no javax.servlet in my build.sbt : > > > scalaVersion := "2.10.4

Re: java.io.NotSerializableException: org.apache.spark.SparkEnv

2014-11-04 Thread lordjoe
I posted on this issue in http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-objects-declared-and-initialized-outside-the-call-method-of-JavaRDD-td17094.html#a17150 Code starts public class SparkUtilities extends Serializable private transient static ThreadLocal threadContext;

Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread spr
The use case I'm working on has a main data stream in which a human needs to modify what to look for. I'm thinking to implement the main data stream with Spark Streaming and the things to look for with Spark. (Better approaches welcome.) To do this, I have intermixed Spark and Spark Streaming cod

SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Terry Siu
I’m trying to execute a subquery inside an IN clause and am encountering an unsupported language feature in the parser. java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (

MEMORY_ONLY_SER question

2014-11-04 Thread Mohit Jaggi
Folks, If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed for a transformation/action later, is the whole partition of the RDD deserialized into Java objects first before my transform/action code works on it? Or is it deserialized in a streaming manner as the iterator moves ov

Re: Got java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s when running job from intellij Idea

2014-11-04 Thread Jaonary Rabarisoa
I don't understand why since there's no javax.servlet in my build.sbt : scalaVersion := "2.10.4" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.1.0", "org.apache.spark" %% "spark-sql" % "1.1.0", "org.apache.spark" %% "spark-mllib" % "1.1.0", "org.apache.hadoop" % "h

Re: with SparkStreeaming spark-submit, don't see output after ssc.start()

2014-11-04 Thread spr
Yes, good catch. I also realized, after I posted, that I was calling 2 different classes, though they are in the same JAR. I went back and tried it again with the same class in both cases, and it failed the same way. I thought perhaps having 2 classes in a JAR was an issue, but commenting out o

RE: Model characterization

2014-11-04 Thread Sameer Tilak
Excellent, many thanks. Really appreciate your help. Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone Original message From: Xiangrui Meng Date:11/03/2014 9:04 PM (GMT-08:00) To: Sameer Tilak Cc: user@spark.apache.org Subject: Re: Model characterization We r

Re: Streaming window operations not producing output

2014-11-04 Thread diogo
So, to answer my own n00b question, if case anyone ever needs it. You have to enable checkpointing (by ssc.checkpoint(hdfsPath)). Windowed operations need to be *checkpointed*, otherwise windows just won't work (and how could they). On Tue, Oct 28, 2014 at 10:24 AM, diogo wrote: > Hi there, I'm

Re: Cleaning/transforming json befor converting to SchemaRDD

2014-11-04 Thread Yin Huai
Hi Daniel, Right now, you need to do the transformation manually. The feature you need is under development (https://issues.apache.org/jira/browse/SPARK-4190). Thanks, Yin On Tue, Nov 4, 2014 at 2:44 AM, Gerard Maas wrote: > You could transform the json to a case class instead of serializing

Spark Streaming getOrCreate

2014-11-04 Thread sivarani
Hi All I am using SparkStreaming.. public class SparkStreaming{ SparkConf sparkConf = new SparkConf().setAppName("Sales"); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5000)); String chkPntDir = ""; //get checkpoint dir jssc.checkpoint(chkPntDir); JavaSpark jSpark

Re: Spark SQL takes unexpected time

2014-11-04 Thread Corey Nolet
Michael, I should probably look closer myself @ the design of 1.2 vs 1.1 but I've been curious why Spark's in-memory data uses the heap instead of putting it off heap? Was this the optimization that was done in 1.2 to alleviate GC? On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari wrote: > Yes, I

stdout in spark applications

2014-11-04 Thread lokeshkumar
Hi Forum, I am running a simple spark application in 1 master and 1 worker. Submitting my application through spark submit as a java program. I have sysout in the program, but I am not finding these sysouts in stdout/stderr links in web ui of master as well in the SPARK_HOME/work directory. Pleas

Re: netty on classpath when using spark-submit

2014-11-04 Thread M. Dale
Tobias, From http://spark.apache.org/docs/latest/configuration.html it seems that there is an experimental property: spark.files.userClassPathFirst Whether to give user-added jars precedence over Spark's own jars when loading classes in Executors. This feature can be used to mitigate conf

Re: How to make sure a ClassPath is always shipped to workers?

2014-11-04 Thread Akhil Das
You can add your custom jar in the SPARK_CLASSPATH inside spark-env.sh file and restart the cluster to get it shipped on all the workers. Also you can use the .setJars option and add the jar while creating the sparkContext. Thanks Best Regards On Tue, Nov 4, 2014 at 8:12 AM, Peng Cheng wrote: >

  1   2   >