Re: Spark job tracker.

2014-07-21 Thread abhiguruvayya
Hello Marcelo Vanzin, Can you explain bit more on this? I tried using client mode but can you explain how can i use this port to write the log or output to this port?Thanks in advance! -- View this message in context:

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-21 Thread Victor Sheng
Hi,Kevin I tried it on spark1.0.0, it works fine. It's a bug in spark1.0.1 ... Thanks, Victor -- View this message in context:

Re: TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-21 Thread Martin Gammelsæter
Aha, that makes sense. Thanks for the response! I guess one of the areas Spark could need some love in in error messages (: On Fri, Jul 18, 2014 at 9:41 PM, Michael Armbrust mich...@databricks.com wrote: Sorry for the non-obvious error message. It is not valid SQL to include attributes in the

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-21 Thread Pei-Lun Lee
Hi Michael, Thanks for the suggestion. In my query, both table are too large to use broadcast join. When SPARK-2211 is done, will spark sql automatically choose join algorithms? Is there some way to manually hint the optimizer? 2014-07-19 5:23 GMT+08:00 Michael Armbrust mich...@databricks.com:

LabeledPoint with weight

2014-07-21 Thread Jiusheng Chen
It seems MLlib right now doesn't support weighted training, training samples have equal importance. Weighted training can be very useful to reduce data size and speed up training. Do you have plan to support it in future? The data format will be something like: label:*weight * index1:value1

Can't see any thing one the storage panel of application UI

2014-07-21 Thread binbinbin915
Hi, I'm running LogisticRegression of mllib. But I can't see the rdd information from storage panel. http://apache-spark-user-list.1001560.n3.nabble.com/file/n10296/a.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n10296/b.png -- View this message in context:

Re: Can't see any thing one the storage panel of application UI

2014-07-21 Thread Preeti Khurana
Am getting the same issue . Spark version : 1.0 On 21/07/14 4:16 PM, binbinbin915 binbinbin...@live.cn wrote: Hi, I'm running LogisticRegression of mllib. But I can't see the rdd information from storage panel. http://apache-spark-user-list.1001560.n3.nabble.com/file/n10296/a.png

Re: the default GraphX graph-partition strategy on multicore machine?

2014-07-21 Thread Yifan LI
Thanks so much, Ankur, :)) Excuse me but I am wondering that: (for a chosen partition strategy for my application) 1.1) how to check the size of each partition? is there any api, or log file? 1.2) how to check the processing cost of each partition(time, memory, etc)? 2.1) and the global

Re: DynamoDB input source

2014-07-21 Thread Ian Wilkinson
Hi, I am invoking the spark-shell (Spark 1.0.0) with: spark-shell --jars \ libs/aws-java-sdk-1.3.26.jar,\ libs/httpclient-4.1.1.jar,\ libs/httpcore-nio-4.1.jar,\ libs/gson-2.1.jar,\ libs/httpclient-cache-4.1.1.jar,\ libs/httpmime-4.1.1.jar,\ libs/hive-dynamodb-handler-0.11.0.jar,\

DROP IF EXISTS still throws exception about table does not exist?

2014-07-21 Thread Nan Zhu
Hi, all When I try hiveContext.hql(drop table if exists abc) where abc is a non-exist table I still received an exception about non-exist table though if exists is there the same statement runs well in hive shell Some feedback from Hive community is here:

Re: NullPointerException When Reading Avro Sequence Files

2014-07-21 Thread Sparky
For those curious I used the JavaSparkContext and got access to an AvroSequenceFile (wrapper around Sequence File) using the following: file = sc.newAPIHadoopFile(hdfs path to my file, AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class, new Configuration()) -- View this

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-07-21 Thread Abel Coronado Iruegas
Hi Yifan This works for me: export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar export SPARK_MEM=40g ./spark-shell Regards On Mon, Jul 21, 2014 at 7:48 AM, Yifan LI iamyifa...@gmail.com wrote: Hi, I am trying to load

Is there anyone who use streaming join to filter spam as guide mentioned?

2014-07-21 Thread hawkwang
Hello guys, I'm just trying to use spark streaming features. I noticed that there is join example for filtering spam, so I just want to try. But, nothing happens after join, the output JavaPairDStream content is same as before. So, is there any examples that I can refer to? Thanks for any

Why spark-submit command hangs?

2014-07-21 Thread Sam Liu
Hi Experts, I setup Yarn and Spark env: all services runs on a single node. And then submited a WordCount job using spark-submit script with command:./bin/spark-submit tests/wordcount-spark-scala.jar --class scala.spark.WordCount --num-executors 1 --driver-memory 300M --executor-memory 300M

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-07-21 Thread Yifan LI
Thanks, Abel. Best, Yifan LI On Jul 21, 2014, at 4:16 PM, Abel Coronado Iruegas acoronadoirue...@gmail.com wrote: Hi Yifan This works for me: export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar export

Is deferred execution of multiple RDDs ever coming?

2014-07-21 Thread Harry Brundage
Hello fellow Sparkians. In https://groups.google.com/d/msg/spark-users/eXV7Bp3phsc/gVgm-MdeEkwJ, Matei suggested that Spark might get deferred grouping and forced execution of multiple jobs in an efficient way. His code sample: rdd.reduceLater(reduceFunction1) // returns Future[ResultType1]

Re: Spark Streaming with long batch / window duration

2014-07-21 Thread aaronjosephs
So I think I may end up using hourglass (https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop) a hadoop framework for incremental data processing, it would be very cool if spark (not streaming ) could support something like this -- View this message in

gain access to persisted rdd

2014-07-21 Thread mrm
Hi, I am using pyspark and have persisted a list of rdds within a function, but I don't have a reference to them anymore. The RDD's are listed in the UI, under the Storage tab, and they have names associated to them (e.g. 4). Is it possible to access the RDD's to unpersist them? Thanks! --

Give more Java Heap Memory on Standalone mode

2014-07-21 Thread Nick R. Katsipoulakis
Hello, Currently I work on a project in which: I spawn a standalone Apache Spark MLlib job in Standalone mode, from a running Java Process. In the code of the Spark Job I have the following code: SparkConf sparkConf = new SparkConf().setAppName(SparkParallelLoad);

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-21 Thread Yin Huai
Hi Victor, Instead of importing sqlContext.createSchemaRDD, can you explicitly call sqlContext.createSchemaRDD(rdd) to create a SchemaRDD? For example, You have a case class Record. case class Record(data_date: String, mobile: String, create_time: String) Then, you create a RDD[Record] and

Re: Spark Streaming timing considerations

2014-07-21 Thread Laeeq Ahmed
Hi TD, Thanks for the help. The only problem left here is that the dstreamTime contains some extra information which seems date i.e. 1405944367000 ms whereas my application timestamps are just in sec which I converted to ms. e.g. 2300, 2400, 2500 etc. So the filter doesn't take effect. I

Re: Spark Streaming timing considerations

2014-07-21 Thread Sean Owen
That is just standard Unix time. 1405944367000 = Sun, 09 Aug 46522 05:56:40 GMT On Mon, Jul 21, 2014 at 5:43 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi TD, Thanks for the help. The only problem left here is that the dstreamTime contains some extra information which seems date i.e.

Re: Spark Streaming timing considerations

2014-07-21 Thread Sean Owen
Uh, right. I mean: 1405944367 = Mon, 21 Jul 2014 12:06:07 GMT On Mon, Jul 21, 2014 at 5:47 PM, Sean Owen so...@cloudera.com wrote: That is just standard Unix time. 1405944367000 = Sun, 09 Aug 46522 05:56:40 GMT On Mon, Jul 21, 2014 at 5:43 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi

Re: Give more Java Heap Memory on Standalone mode

2014-07-21 Thread Nick R. Katsipoulakis
Thank you Abel, It seems that your advice worked. Even though I receive a message that it is a deprecated way of defining Spark Memory (the system prompts that I should set spark.driver.memory), the memory is increased. Again, thank you, Nick On Mon, Jul 21, 2014 at 9:42 AM, Abel Coronado

relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren
It is really nice that Spark RDD's provide functions that are often equivalent to functions found in Scala collections. For example, I can call: myArray.map(myFx) and equivalently myRdd.map(myFx) Awesome! My question is this. Is it possible to write code that works on either an RDD or

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Michael Malak
It's really more of a Scala question than a Spark question, but the standard OO (not Scala-specific) way is to create your own custom supertype (e.g. MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD and MyArray), each of which manually forwards method calls to the

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread andy petrella
heya, Without a bit of gymnastic at the type level, nope. Actually RDD doesn't share any functions with the scala lib (the simple reason I could see is that the Spark's ones are lazy, the default implementations in Scala aren't). However, it'd be possible by implementing an implicit converter

Re: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread mrm
I have the same error! Did you manage to fix it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread Yin Huai
Instead of using union, can you try sqlContext.parquetFile(/user/ hive/warehouse/xxx_parquet.db).registerAsTable(parquetTable)? Then, var all = sql(select some_id, some_type, some_time from parquetTable).map(line = (line(0), (line(1).toString, line(2).toString.substring(0, 19 Thanks, Yin

RE: Hive From Spark

2014-07-21 Thread Andrew Lee
Hi All, Currently, if you are running Spark HiveContext API with Hive 0.12, it won't work due to the following 2 libraries which are not consistent with Hive 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common practice, they should be consistent to work inter-operable).

Re: Hive From Spark

2014-07-21 Thread Sean Owen
I haven't seen anyone actively 'unwilling' -- I hope not. See discussion at https://issues.apache.org/jira/browse/SPARK-2420 where I sketch what a downgrade means. I think it just hasn't gotten a looking over. Contrary to what I thought earlier, the conflict does in fact cause problems in theory,

Re: Why spark-submit command hangs?

2014-07-21 Thread Andrew Or
Hi Sam, Did you specify the MASTER in your spark-env.sh? I ask because I didn't see a --master in your launch command. Also, your app seems to take in a master (yarn-standalone). This is not exactly correct because by the time the SparkContext is launched locally, which is the default, it is too

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren
Thanks Michael, That is one solution that I had thought of. It seems like a bit of overkill for the few methods I want to do this for - but I will think about it. I guess I was hoping that I was missing something more obvious/easier. Philip On 07/21/2014 11:20 AM, Michael Malak wrote:

Re: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread Andrew Or
Hi all, This error happens because we receive a completed event for a particular stage that we don't know about, i.e. a stage we haven't received a submitted event for. The root cause of this, as Baoxu explained, is usually because the event queue is full and the listener begins to drop events.

RDD pipe partitionwise

2014-07-21 Thread Jaonary Rabarisoa
Dear all, Is there any example of mapPartitions that fork external process or how to make RDD.pipe working on every data of a partition ? Cheers, Jaonary

Re: gain access to persisted rdd

2014-07-21 Thread Andrew Or
Hi Maria, If you don't have a reference to a persisted RDD, it will be automatically unpersisted on the next GC by the ContextCleaner. This is implemented for scala, but should still work in python because python uses reference counting to clean up objects that are no longer strongly referenced.

Re: Give more Java Heap Memory on Standalone mode

2014-07-21 Thread Andrew Or
Hi Nick and Abel, Looks like you are requesting 8g for your executors, but only allowing 2g on the workers. You should set SPARK_WORKER_MEMORY to at least 8g if you intend to use that much memory in your application. Also, you shouldn't have to set SPARK_DAEMON_JAVA_OPTS; you can just set

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread chutium
Hi, unfortunately it is not so straightforward xxx_parquet.db is a folder of managed database created by hive/impala, so, every sub element in it is a table in hive/impala, they are folders in HDFS, and each table has different schema, and in its folder there are one or more parquet files.

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-21 Thread Andrew Or
Hi Rindra, Depending on what you're doing with your groupBy, you may end up inflating your data quite a bit. Even if your machine has 16G, by default spark-shell only uses 512M, and the amount used for storing blocks is only 60% of that (spark.storage.memoryFraction), so this space becomes ~300M.

Re: gain access to persisted rdd

2014-07-21 Thread chutium
but at least if user want to access the persisted RDDs, they can use sc.getPersistentRDDs in the same context. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10337.html Sent from the Apache Spark User List mailing list

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Bill Jay
Hi Tathagata, I am currentlycreating multiple DStream to consumefrom different topics. How can I let each consumer consume from different partitions. I find the following parameters from Spark API: createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc: JavaStreamingContext

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-21 Thread Matt Work Coarr
I got this working by having our sysadmin update our security group to allow incoming traffic from the local subnet on ports 1-65535. I'm not sure if there's a more specific range I could have used, but so far, everything is running! Thanks for all the responses Marcelo and Andrew!! Matt

launching a spark cluster in ec2 from within an application

2014-07-21 Thread M@
I would like to programmatically start a spark cluster in ec2 from another app running in ec2, run my job and then destroy the cluster. I can launch a spark EMR cluster easily enough using the SDK however I ran into two problems: 1) I was only able to retrieve the address of the master node from

Re: LabeledPoint with weight

2014-07-21 Thread Xiangrui Meng
This is a useful feature but it may be hard to have it in v1.1 due to limited time. Hopefully, we can support it in v1.2. -Xiangrui On Mon, Jul 21, 2014 at 12:58 AM, Jiusheng Chen chenjiush...@gmail.com wrote: It seems MLlib right now doesn't support weighted training, training samples have

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread Aaron Davidson
What's the exception you're seeing? Is it an OOM? On Mon, Jul 21, 2014 at 11:20 AM, chutium teng@gmail.com wrote: Hi, unfortunately it is not so straightforward xxx_parquet.db is a folder of managed database created by hive/impala, so, every sub element in it is a table in

Re: Spark job tracker.

2014-07-21 Thread abhiguruvayya
An also i am facing one issue. If i run the program in yarn-cluster mode it works absolutely fine but if i change it to yarn-client mode i get this below error. Application application_1405471266091_0055 failed 2 times due to AM Container for appattempt_1405471266091_0055_02 exited with

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread chutium
no, something like this 14/07/20 00:19:29 ERROR cluster.YarnClientClusterScheduler: Lost executor 2 on 02.xxx: remote Akka client disassociated ... ... 14/07/20 00:21:13 WARN scheduler.TaskSetManager: Lost TID 832 (task 1.2:186) 14/07/20 00:21:13 WARN scheduler.TaskSetManager: Loss was

Does spark streaming fit to our application

2014-07-21 Thread srinivas
Hi, Our application is required to do some aggregations on data that will be coming as a stream for over two months. I would like to know if spark streaming will be suitable for our requirement. After going through some documentation and videos i think we can do aggregations on data based on

broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu
Hi, all When I run some Spark application (actually unit test of the application in Jenkins ), I found that I always hit the FileNotFoundException when reading broadcast variable The program itself works well, except the unit test Here is the example log: 14/07/21 19:49:13 INFO

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-21 Thread Michael Armbrust
When SPARK-2211 is done, will spark sql automatically choose join algorithms? Is there some way to manually hint the optimizer? Ideally we will select the best algorithm for you. We are also considering ways to allow the user to hint.

Re: Spark Streaming timing considerations

2014-07-21 Thread Tathagata Das
You will have to use some function that converts the dstreamTime (ms since epoch, same format as returned by System.currentTimeMillis), and your application-level time. TD On Mon, Jul 21, 2014 at 9:47 AM, Sean Owen so...@cloudera.com wrote: Uh, right. I mean: 1405944367 = Mon, 21 Jul 2014

Re: Uber jar with SBT

2014-07-21 Thread Tathagata Das
Just to confirm, are you interested in submitting the spark job inside the cluster of the spark standalone mode (that is, one of the workers will be running the driver)? For that, spark-submit does support that fully yet. You can probably use the instructions present in Spark 0.9.1 to do that.

Re: Is there anyone who use streaming join to filter spam as guide mentioned?

2014-07-21 Thread Tathagata Das
Could you share your code snippet so that we can take a look? TD On Mon, Jul 21, 2014 at 7:23 AM, hawkwang wanghao.b...@gmail.com wrote: Hello guys, I'm just trying to use spark streaming features. I noticed that there is join example for filtering spam, so I just want to try. But,

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu
Hi, TD, Thanks for the reply I tried to reproduce this in a simpler program, but no luck However, the program has been very simple, just load some files from HDFS and write them to HBase…. --- It seems that the issue only appears when I run the unit test in Jenkins (not fail every time,

Spark Partitioner vs Cassandra Partitioner

2014-07-21 Thread Marcelo Elias Del Valle
Hi, I am new to Spark, have used hadoop for some time and just entered the mailing list. I am considering using spark in my application, reading data from Cassandra in Python and writting mapped data back to Cassandra or to ES after it. The first question I have is: Is it possible to use

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu
Hi, TD, I think I got more insights to the problem in the Jenkins test file, I mistakenly pass a wrong value to spark.cores.max, which is much larger than the expected value (I passed master address as local[6], and spark.core.max as 200) If I set a more consistent value, everything goes

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-21 Thread Matei Zaharia
Actually the script in the master branch is also broken (it's pointing to an older AMI). Try 1.0.1 for launching clusters. On Jul 20, 2014, at 2:25 PM, Chris DuBois chris.dub...@gmail.com wrote: I pulled the latest last night. I'm on commit 4da01e3. On Sun, Jul 20, 2014 at 2:08 PM, Matei

Re: DROP IF EXISTS still throws exception about table does not exist?

2014-07-21 Thread Nan Zhu
Ah, I see, thanks, Yin -- Nan Zhu On Monday, July 21, 2014 at 5:00 PM, Yin Huai wrote: Hi Nan, It is basically a log entry because your table does not exist. It is not a real exception. Thanks, Yin On Mon, Jul 21, 2014 at 7:10 AM, Nan Zhu zhunanmcg...@gmail.com

unable to create rdd with pyspark newAPIHadoopRDD

2014-07-21 Thread umeshdangat
Hello, I have only just started playing around with spark to see if it fits my needs. I was trying to read some data from elasticsearch as an rdd, so that I could perform some python based analytics on it. I am unable to create the rdd object as of now, failing with a serialization error.

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Tathagata Das
That is definitely weird. spark.core.max should not affect thing when they are running local mode. And, I am trying to think of scenarios that could cause a broadcast variable used in the current job to fall out of scope, but they all seem very far fetched. So i am really curious to see the code

RE: error from DecisonTree Training:

2014-07-21 Thread Jack Yang
So this is a bug unsolved (for java) yet? From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 18 July 2014 4:52 PM To: user@spark.apache.org Subject: error from DecisonTree Training: Hi All, I got an error while using DecisionTreeModel (my program is written in Java, spark 1.0.1, scala

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-21 Thread Xiangrui Meng
You can also try a different region. I tested us-west-2 yesterday, and it worked well. -Xiangrui On Sun, Jul 20, 2014 at 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Actually the script in the master branch is also broken (it's pointing to an older AMI). Try 1.0.1 for launching

Re: error from DecisonTree Training:

2014-07-21 Thread Xiangrui Meng
This is a known issue: https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is working on it. -Xiangrui On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang j...@uow.edu.au wrote: So this is a bug unsolved (for java) yet? From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 18 July 2014 4:52

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu
Ah, sorry, sorry, my brain just damaged….. sent some wrong information not “spark.cores.max” but the minPartitions in sc.textFile() Best, -- Nan Zhu On Monday, July 21, 2014 at 7:17 PM, Tathagata Das wrote: That is definitely weird. spark.core.max should not affect thing when they

Re: How to map each line to (line number, line)?

2014-07-21 Thread Andrew Ash
I'm not sure if you guys ever picked a preferred method for doing this, but I just encountered it and came up with this method that's working reasonably well on a small dataset. It should be quite easily generalizable to non-String RDDs. def addRowNumber(r: RDD[String]): RDD[Tuple2[Long,String]]

RE: error from DecisonTree Training:

2014-07-21 Thread Jack Yang
That is nice. Thanks Xiangrui. -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, 22 July 2014 9:31 AM To: user@spark.apache.org Subject: Re: error from DecisonTree Training: This is a known issue: https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is

答复: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread 余根茂(木艮)
Hi all, Here is my fix https://github.com/apache/spark/pull/1356, although not handsome, but work well. Any Suggestions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html

Joining by timestamp.

2014-07-21 Thread durga
Hi I have peculiar problem, I have two data sets (large ones) . Data set1: ((timestamp),iterable[Any]) = { (2014-07-10T00:02:45.045+,ArrayBuffer((2014-07-10T00:02:45.045+,98.4859,22))) (2014-07-10T00:07:32.618+,ArrayBuffer((2014-07-10T00:07:32.618+,75.4737,22))) } DataSet2:

Re: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-21 Thread hsy...@gmail.com
I have the same problem On Sat, Jul 19, 2014 at 12:31 AM, lihu lihu...@gmail.com wrote: Hi, Everyone. I have a piece of following code. When I run it, it occurred the error just like below, it seem that the SparkContext is not serializable, but i do not try to use the SparkContext

saveAsSequenceFile for DStream

2014-07-21 Thread Barnaby
First of all, I do not know Scala, but learning. I'm doing a proof of concept by streaming content from a socket, counting the words and write it to a Tachyon disk. A different script will read the file stream and print out the results. val lines = ssc.socketTextStream(args(0), args(1).toInt,

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao
This is a very interesting problem. SparkSQL supports the Non Equi Join, but it is in very low efficiency with large tables. One possible solution is make both table partition based and the partition keys are (cast(ds as bigint) / 240), and with each partition in dataset1, you probably can

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Tobias Pfeiffer
Bill, numPartitions means the number of Spark partitions that the data received from Kafka will be split to. It has nothing to do with Kafka partitions, as far as I know. If you create multiple Kafka consumers, it doesn't seem like you can specify which consumer will consume which Kafka

Understanding Spark

2014-07-21 Thread omergul123
Hi, I'm just a new one in the world big data and I'm trying understand the use cases of several projects. Of course one of them is Spark. I wanna know that what is the proper way of examining my data that resides on my MySQL server? Think that I'm saving every page view of a user with the

defaultMinPartitions in textFile

2014-07-21 Thread Wang, Jensen
Hi, I started to use spark on yarn recently and found a problem while tuning my program. When SparkContext is initialized as sc and ready to read text file from hdfs, the textFile(path, defaultMinPartitions) method is called. I traced down the second parameter in the spark source code

RE: data locality

2014-07-21 Thread Haopu Wang
Sandy, I just tried the standalone cluster and didn't have chance to try Yarn yet. So if I understand correctly, there are *no* special handling of task assignment according to the HDFS block's location when Spark is running as a *standalone* cluster. Please correct me if I'm wrong. Thank

Re: Question about initial message in graphx

2014-07-21 Thread Ankur Dave
On Mon, Jul 21, 2014 at 8:05 PM, Bin WU bw...@connect.ust.hk wrote: I am not sure how to specify different initial values to each node in the graph. Moreover, I am wondering why initial message is necessary. I think we can instead initialize the graph and then pass it into Pregel interface?

new error for me

2014-07-21 Thread Nathan Kronenfeld
Does anyone know what this error means: 14/07/21 23:07:22 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 14/07/21 23:07:22 INFO TaskSetManager: Starting task 3.0:0 as TID 1620 on executor 27: r104u05.oculus.local (PROCESS_LOCAL) 14/07/21 23:07:22 INFO TaskSetManager: Serialized task

Re: defaultMinPartitions in textFile

2014-07-21 Thread Ye Xianjin
well, I think you miss this line of code in SparkContext.scala line 1242-1243(master): /** Default min number of partitions for Hadoop RDDs when not given by user */ def defaultMinPartitions: Int = math.min(defaultParallelism, 2) so the defaultMinPartitions will be 2 unless the

RE: defaultMinPartitions in textFile

2014-07-21 Thread Wang, Jensen
Yes, Great. I thought it was math.max instead of math.min on that line. Thank you! From: Ye Xianjin [mailto:advance...@gmail.com] Sent: Tuesday, July 22, 2014 11:37 AM To: user@spark.apache.org Subject: Re: defaultMinPartitions in textFile well, I think you miss this line of code in

RE: Joining by timestamp.

2014-07-21 Thread durga
Hi Chen, I am new to the Spark as well as SparkSQL , could you please explain how would I create a table and run query on top of it.That would be super helpful. Thanks, D. -- View this message in context:

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao
Actually it's just a pseudo algorithm I described, you can do it with spark API. Hope the algorithm helpful. -Original Message- From: durga [mailto:durgak...@gmail.com] Sent: Tuesday, July 22, 2014 11:56 AM To: u...@spark.incubator.apache.org Subject: RE: Joining by timestamp. Hi Chen,

RE: Joining by timestamp.

2014-07-21 Thread durga
Hi Chen, Thank you very much for your reply. I think I do not understand how can I do the join using spark api. If you have time , could you please write some code . Thanks again, D. -- View this message in context:

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao
Durga, you can start from the documents http://spark.apache.org/docs/latest/quick-start.html http://spark.apache.org/docs/latest/programming-guide.html -Original Message- From: durga [mailto:durgak...@gmail.com] Sent: Tuesday, July 22, 2014 12:45 PM To: