date:20140721

Re: Spark job tracker.

2014-07-21 Thread abhiguruvayya

Hello Marcelo Vanzin, Can you explain bit more on this? I tried using client mode but can you explain how can i use this port to write the log or output to this port?Thanks in advance! -- View this message in context:

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-21 Thread Victor Sheng

Hi,Kevin I tried it on spark1.0.0, it works fine. It's a bug in spark1.0.1 ... Thanks, Victor -- View this message in context:

Re: TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-21 Thread Martin Gammelsæter

Aha, that makes sense. Thanks for the response! I guess one of the areas Spark could need some love in in error messages (: On Fri, Jul 18, 2014 at 9:41 PM, Michael Armbrust mich...@databricks.com wrote: Sorry for the non-obvious error message. It is not valid SQL to include attributes in the

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-21 Thread Pei-Lun Lee

Hi Michael, Thanks for the suggestion. In my query, both table are too large to use broadcast join. When SPARK-2211 is done, will spark sql automatically choose join algorithms? Is there some way to manually hint the optimizer? 2014-07-19 5:23 GMT+08:00 Michael Armbrust mich...@databricks.com:

LabeledPoint with weight

2014-07-21 Thread Jiusheng Chen

It seems MLlib right now doesn't support weighted training, training samples have equal importance. Weighted training can be very useful to reduce data size and speed up training. Do you have plan to support it in future? The data format will be something like: label:*weight * index1:value1

Can't see any thing one the storage panel of application UI

2014-07-21 Thread binbinbin915

Hi, I'm running LogisticRegression of mllib. But I can't see the rdd information from storage panel. http://apache-spark-user-list.1001560.n3.nabble.com/file/n10296/a.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n10296/b.png -- View this message in context:

Re: Can't see any thing one the storage panel of application UI

2014-07-21 Thread Preeti Khurana

Am getting the same issue . Spark version : 1.0 On 21/07/14 4:16 PM, binbinbin915 binbinbin...@live.cn wrote: Hi, I'm running LogisticRegression of mllib. But I can't see the rdd information from storage panel. http://apache-spark-user-list.1001560.n3.nabble.com/file/n10296/a.png

Re: the default GraphX graph-partition strategy on multicore machine?

2014-07-21 Thread Yifan LI

Thanks so much, Ankur, :)) Excuse me but I am wondering that: (for a chosen partition strategy for my application) 1.1) how to check the size of each partition? is there any api, or log file? 1.2) how to check the processing cost of each partition(time, memory, etc)? 2.1) and the global

Re: DynamoDB input source

2014-07-21 Thread Ian Wilkinson

Hi, I am invoking the spark-shell (Spark 1.0.0) with: spark-shell --jars \ libs/aws-java-sdk-1.3.26.jar,\ libs/httpclient-4.1.1.jar,\ libs/httpcore-nio-4.1.jar,\ libs/gson-2.1.jar,\ libs/httpclient-cache-4.1.1.jar,\ libs/httpmime-4.1.1.jar,\ libs/hive-dynamodb-handler-0.11.0.jar,\

DROP IF EXISTS still throws exception about table does not exist?

2014-07-21 Thread Nan Zhu

Hi, all When I try hiveContext.hql(drop table if exists abc) where abc is a non-exist table I still received an exception about non-exist table though if exists is there the same statement runs well in hive shell Some feedback from Hive community is here:

Re: NullPointerException When Reading Avro Sequence Files

2014-07-21 Thread Sparky

For those curious I used the JavaSparkContext and got access to an AvroSequenceFile (wrapper around Sequence File) using the following: file = sc.newAPIHadoopFile(hdfs path to my file, AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class, new Configuration()) -- View this

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-07-21 Thread Abel Coronado Iruegas

Hi Yifan This works for me: export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar export SPARK_MEM=40g ./spark-shell Regards On Mon, Jul 21, 2014 at 7:48 AM, Yifan LI iamyifa...@gmail.com wrote: Hi, I am trying to load

Is there anyone who use streaming join to filter spam as guide mentioned?

2014-07-21 Thread hawkwang

Hello guys, I'm just trying to use spark streaming features. I noticed that there is join example for filtering spam, so I just want to try. But, nothing happens after join, the output JavaPairDStream content is same as before. So, is there any examples that I can refer to? Thanks for any

Why spark-submit command hangs?

2014-07-21 Thread Sam Liu

Hi Experts, I setup Yarn and Spark env: all services runs on a single node. And then submited a WordCount job using spark-submit script with command:./bin/spark-submit tests/wordcount-spark-scala.jar --class scala.spark.WordCount --num-executors 1 --driver-memory 300M --executor-memory 300M

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-07-21 Thread Yifan LI

Thanks, Abel. Best, Yifan LI On Jul 21, 2014, at 4:16 PM, Abel Coronado Iruegas acoronadoirue...@gmail.com wrote: Hi Yifan This works for me: export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar export

Is deferred execution of multiple RDDs ever coming?

2014-07-21 Thread Harry Brundage

Hello fellow Sparkians. In https://groups.google.com/d/msg/spark-users/eXV7Bp3phsc/gVgm-MdeEkwJ, Matei suggested that Spark might get deferred grouping and forced execution of multiple jobs in an efficient way. His code sample: rdd.reduceLater(reduceFunction1) // returns Future[ResultType1]

Re: Spark Streaming with long batch / window duration

2014-07-21 Thread aaronjosephs

So I think I may end up using hourglass (https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop) a hadoop framework for incremental data processing, it would be very cool if spark (not streaming ) could support something like this -- View this message in

gain access to persisted rdd

2014-07-21 Thread mrm

Hi, I am using pyspark and have persisted a list of rdds within a function, but I don't have a reference to them anymore. The RDD's are listed in the UI, under the Storage tab, and they have names associated to them (e.g. 4). Is it possible to access the RDD's to unpersist them? Thanks! --

Give more Java Heap Memory on Standalone mode

2014-07-21 Thread Nick R. Katsipoulakis

Hello, Currently I work on a project in which: I spawn a standalone Apache Spark MLlib job in Standalone mode, from a running Java Process. In the code of the Spark Job I have the following code: SparkConf sparkConf = new SparkConf().setAppName(SparkParallelLoad);

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-21 Thread Yin Huai

Hi Victor, Instead of importing sqlContext.createSchemaRDD, can you explicitly call sqlContext.createSchemaRDD(rdd) to create a SchemaRDD? For example, You have a case class Record. case class Record(data_date: String, mobile: String, create_time: String) Then, you create a RDD[Record] and

Re: Spark Streaming timing considerations

2014-07-21 Thread Laeeq Ahmed

Hi TD, Thanks for the help. The only problem left here is that the dstreamTime contains some extra information which seems date i.e. 1405944367000 ms whereas my application timestamps are just in sec which I converted to ms. e.g. 2300, 2400, 2500 etc. So the filter doesn't take effect. I

Re: Spark Streaming timing considerations

2014-07-21 Thread Sean Owen

That is just standard Unix time. 1405944367000 = Sun, 09 Aug 46522 05:56:40 GMT On Mon, Jul 21, 2014 at 5:43 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi TD, Thanks for the help. The only problem left here is that the dstreamTime contains some extra information which seems date i.e.

Re: Spark Streaming timing considerations

2014-07-21 Thread Sean Owen

Uh, right. I mean: 1405944367 = Mon, 21 Jul 2014 12:06:07 GMT On Mon, Jul 21, 2014 at 5:47 PM, Sean Owen so...@cloudera.com wrote: That is just standard Unix time. 1405944367000 = Sun, 09 Aug 46522 05:56:40 GMT On Mon, Jul 21, 2014 at 5:43 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi

Re: Give more Java Heap Memory on Standalone mode

2014-07-21 Thread Nick R. Katsipoulakis

Thank you Abel, It seems that your advice worked. Even though I receive a message that it is a deprecated way of defining Spark Memory (the system prompts that I should set spark.driver.memory), the memory is increased. Again, thank you, Nick On Mon, Jul 21, 2014 at 9:42 AM, Abel Coronado

relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren

It is really nice that Spark RDD's provide functions that are often equivalent to functions found in Scala collections. For example, I can call: myArray.map(myFx) and equivalently myRdd.map(myFx) Awesome! My question is this. Is it possible to write code that works on either an RDD or

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Michael Malak

It's really more of a Scala question than a Spark question, but the standard OO (not Scala-specific) way is to create your own custom supertype (e.g. MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD and MyArray), each of which manually forwards method calls to the

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread andy petrella

heya, Without a bit of gymnastic at the type level, nope. Actually RDD doesn't share any functions with the scala lib (the simple reason I could see is that the Spark's ones are lazy, the default implementations in Scala aren't). However, it'd be possible by implementing an implicit converter

Re: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread mrm

I have the same error! Did you manage to fix it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread Yin Huai

Instead of using union, can you try sqlContext.parquetFile(/user/ hive/warehouse/xxx_parquet.db).registerAsTable(parquetTable)? Then, var all = sql(select some_id, some_type, some_time from parquetTable).map(line = (line(0), (line(1).toString, line(2).toString.substring(0, 19 Thanks, Yin

RE: Hive From Spark

2014-07-21 Thread Andrew Lee

Hi All, Currently, if you are running Spark HiveContext API with Hive 0.12, it won't work due to the following 2 libraries which are not consistent with Hive 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common practice, they should be consistent to work inter-operable).

Re: Hive From Spark

2014-07-21 Thread Sean Owen

I haven't seen anyone actively 'unwilling' -- I hope not. See discussion at https://issues.apache.org/jira/browse/SPARK-2420 where I sketch what a downgrade means. I think it just hasn't gotten a looking over. Contrary to what I thought earlier, the conflict does in fact cause problems in theory,

Re: Why spark-submit command hangs?

2014-07-21 Thread Andrew Or

Hi Sam, Did you specify the MASTER in your spark-env.sh? I ask because I didn't see a --master in your launch command. Also, your app seems to take in a master (yarn-standalone). This is not exactly correct because by the time the SparkContext is launched locally, which is the default, it is too

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren

Thanks Michael, That is one solution that I had thought of. It seems like a bit of overkill for the few methods I want to do this for - but I will think about it. I guess I was hoping that I was missing something more obvious/easier. Philip On 07/21/2014 11:20 AM, Michael Malak wrote:

Re: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread Andrew Or

Hi all, This error happens because we receive a completed event for a particular stage that we don't know about, i.e. a stage we haven't received a submitted event for. The root cause of this, as Baoxu explained, is usually because the event queue is full and the listener begins to drop events.

RDD pipe partitionwise

2014-07-21 Thread Jaonary Rabarisoa

Dear all, Is there any example of mapPartitions that fork external process or how to make RDD.pipe working on every data of a partition ? Cheers, Jaonary

Re: gain access to persisted rdd

2014-07-21 Thread Andrew Or

Hi Maria, If you don't have a reference to a persisted RDD, it will be automatically unpersisted on the next GC by the ContextCleaner. This is implemented for scala, but should still work in python because python uses reference counting to clean up objects that are no longer strongly referenced.

Re: Give more Java Heap Memory on Standalone mode

2014-07-21 Thread Andrew Or

Hi Nick and Abel, Looks like you are requesting 8g for your executors, but only allowing 2g on the workers. You should set SPARK_WORKER_MEMORY to at least 8g if you intend to use that much memory in your application. Also, you shouldn't have to set SPARK_DAEMON_JAVA_OPTS; you can just set

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread chutium

Hi, unfortunately it is not so straightforward xxx_parquet.db is a folder of managed database created by hive/impala, so, every sub element in it is a table in hive/impala, they are folders in HDFS, and each table has different schema, and in its folder there are one or more parquet files.

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-21 Thread Andrew Or

Hi Rindra, Depending on what you're doing with your groupBy, you may end up inflating your data quite a bit. Even if your machine has 16G, by default spark-shell only uses 512M, and the amount used for storing blocks is only 60% of that (spark.storage.memoryFraction), so this space becomes ~300M.

Re: gain access to persisted rdd

2014-07-21 Thread chutium

but at least if user want to access the persisted RDDs, they can use sc.getPersistentRDDs in the same context. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10337.html Sent from the Apache Spark User List mailing list

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Bill Jay

Hi Tathagata, I am currentlycreating multiple DStream to consumefrom different topics. How can I let each consumer consume from different partitions. I find the following parameters from Spark API: createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc: JavaStreamingContext

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-21 Thread Matt Work Coarr

I got this working by having our sysadmin update our security group to allow incoming traffic from the local subnet on ports 1-65535. I'm not sure if there's a more specific range I could have used, but so far, everything is running! Thanks for all the responses Marcelo and Andrew!! Matt

launching a spark cluster in ec2 from within an application

2014-07-21 Thread M@

I would like to programmatically start a spark cluster in ec2 from another app running in ec2, run my job and then destroy the cluster. I can launch a spark EMR cluster easily enough using the SDK however I ran into two problems: 1) I was only able to retrieve the address of the master node from

Re: LabeledPoint with weight

2014-07-21 Thread Xiangrui Meng

This is a useful feature but it may be hard to have it in v1.1 due to limited time. Hopefully, we can support it in v1.2. -Xiangrui On Mon, Jul 21, 2014 at 12:58 AM, Jiusheng Chen chenjiush...@gmail.com wrote: It seems MLlib right now doesn't support weighted training, training samples have

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread Aaron Davidson

What's the exception you're seeing? Is it an OOM? On Mon, Jul 21, 2014 at 11:20 AM, chutium teng@gmail.com wrote: Hi, unfortunately it is not so straightforward xxx_parquet.db is a folder of managed database created by hive/impala, so, every sub element in it is a table in

Re: Spark job tracker.

2014-07-21 Thread abhiguruvayya

An also i am facing one issue. If i run the program in yarn-cluster mode it works absolutely fine but if i change it to yarn-client mode i get this below error. Application application_1405471266091_0055 failed 2 times due to AM Container for appattempt_1405471266091_0055_02 exited with

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread chutium

no, something like this 14/07/20 00:19:29 ERROR cluster.YarnClientClusterScheduler: Lost executor 2 on 02.xxx: remote Akka client disassociated ... ... 14/07/20 00:21:13 WARN scheduler.TaskSetManager: Lost TID 832 (task 1.2:186) 14/07/20 00:21:13 WARN scheduler.TaskSetManager: Loss was

Does spark streaming fit to our application

2014-07-21 Thread srinivas

Hi, Our application is required to do some aggregations on data that will be coming as a stream for over two months. I would like to know if spark streaming will be suitable for our requirement. After going through some documentation and videos i think we can do aggregations on data based on

broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu

Hi, all When I run some Spark application (actually unit test of the application in Jenkins ), I found that I always hit the FileNotFoundException when reading broadcast variable The program itself works well, except the unit test Here is the example log: 14/07/21 19:49:13 INFO

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-21 Thread Michael Armbrust

When SPARK-2211 is done, will spark sql automatically choose join algorithms? Is there some way to manually hint the optimizer? Ideally we will select the best algorithm for you. We are also considering ways to allow the user to hint.

Re: Spark Streaming timing considerations

2014-07-21 Thread Tathagata Das

You will have to use some function that converts the dstreamTime (ms since epoch, same format as returned by System.currentTimeMillis), and your application-level time. TD On Mon, Jul 21, 2014 at 9:47 AM, Sean Owen so...@cloudera.com wrote: Uh, right. I mean: 1405944367 = Mon, 21 Jul 2014

Re: Uber jar with SBT

2014-07-21 Thread Tathagata Das

Just to confirm, are you interested in submitting the spark job inside the cluster of the spark standalone mode (that is, one of the workers will be running the driver)? For that, spark-submit does support that fully yet. You can probably use the instructions present in Spark 0.9.1 to do that.

Re: Is there anyone who use streaming join to filter spam as guide mentioned?

2014-07-21 Thread Tathagata Das

Could you share your code snippet so that we can take a look? TD On Mon, Jul 21, 2014 at 7:23 AM, hawkwang wanghao.b...@gmail.com wrote: Hello guys, I'm just trying to use spark streaming features. I noticed that there is join example for filtering spam, so I just want to try. But,

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu

Hi, TD, Thanks for the reply I tried to reproduce this in a simpler program, but no luck However, the program has been very simple, just load some files from HDFS and write them to HBase…. --- It seems that the issue only appears when I run the unit test in Jenkins (not fail every time,

Spark Partitioner vs Cassandra Partitioner

2014-07-21 Thread Marcelo Elias Del Valle

Hi, I am new to Spark, have used hadoop for some time and just entered the mailing list. I am considering using spark in my application, reading data from Cassandra in Python and writting mapped data back to Cassandra or to ES after it. The first question I have is: Is it possible to use

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu

Hi, TD, I think I got more insights to the problem in the Jenkins test file, I mistakenly pass a wrong value to spark.cores.max, which is much larger than the expected value (I passed master address as local[6], and spark.core.max as 200) If I set a more consistent value, everything goes

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-21 Thread Matei Zaharia

Actually the script in the master branch is also broken (it's pointing to an older AMI). Try 1.0.1 for launching clusters. On Jul 20, 2014, at 2:25 PM, Chris DuBois chris.dub...@gmail.com wrote: I pulled the latest last night. I'm on commit 4da01e3. On Sun, Jul 20, 2014 at 2:08 PM, Matei

Re: DROP IF EXISTS still throws exception about table does not exist?

2014-07-21 Thread Nan Zhu

Ah, I see, thanks, Yin -- Nan Zhu On Monday, July 21, 2014 at 5:00 PM, Yin Huai wrote: Hi Nan, It is basically a log entry because your table does not exist. It is not a real exception. Thanks, Yin On Mon, Jul 21, 2014 at 7:10 AM, Nan Zhu zhunanmcg...@gmail.com

unable to create rdd with pyspark newAPIHadoopRDD

2014-07-21 Thread umeshdangat

Hello, I have only just started playing around with spark to see if it fits my needs. I was trying to read some data from elasticsearch as an rdd, so that I could perform some python based analytics on it. I am unable to create the rdd object as of now, failing with a serialization error.

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Tathagata Das

That is definitely weird. spark.core.max should not affect thing when they are running local mode. And, I am trying to think of scenarios that could cause a broadcast variable used in the current job to fall out of scope, but they all seem very far fetched. So i am really curious to see the code

RE: error from DecisonTree Training:

2014-07-21 Thread Jack Yang

So this is a bug unsolved (for java) yet? From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 18 July 2014 4:52 PM To: user@spark.apache.org Subject: error from DecisonTree Training: Hi All, I got an error while using DecisionTreeModel (my program is written in Java, spark 1.0.1, scala

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-21 Thread Xiangrui Meng

You can also try a different region. I tested us-west-2 yesterday, and it worked well. -Xiangrui On Sun, Jul 20, 2014 at 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Actually the script in the master branch is also broken (it's pointing to an older AMI). Try 1.0.1 for launching

Re: error from DecisonTree Training:

2014-07-21 Thread Xiangrui Meng

This is a known issue: https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is working on it. -Xiangrui On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang j...@uow.edu.au wrote: So this is a bug unsolved (for java) yet? From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 18 July 2014 4:52

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu

Ah, sorry, sorry, my brain just damaged….. sent some wrong information not “spark.cores.max” but the minPartitions in sc.textFile() Best, -- Nan Zhu On Monday, July 21, 2014 at 7:17 PM, Tathagata Das wrote: That is definitely weird. spark.core.max should not affect thing when they

Re: How to map each line to (line number, line)?

2014-07-21 Thread Andrew Ash

I'm not sure if you guys ever picked a preferred method for doing this, but I just encountered it and came up with this method that's working reasonably well on a small dataset. It should be quite easily generalizable to non-String RDDs. def addRowNumber(r: RDD[String]): RDD[Tuple2[Long,String]]

RE: error from DecisonTree Training:

2014-07-21 Thread Jack Yang

That is nice. Thanks Xiangrui. -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, 22 July 2014 9:31 AM To: user@spark.apache.org Subject: Re: error from DecisonTree Training: This is a known issue: https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is

答复: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread 余根茂(木艮)

Hi all, Here is my fix https://github.com/apache/spark/pull/1356, although not handsome, but work well. Any Suggestions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html

Joining by timestamp.

2014-07-21 Thread durga

Hi I have peculiar problem, I have two data sets (large ones) . Data set1: ((timestamp),iterable[Any]) = { (2014-07-10T00:02:45.045+,ArrayBuffer((2014-07-10T00:02:45.045+,98.4859,22))) (2014-07-10T00:07:32.618+,ArrayBuffer((2014-07-10T00:07:32.618+,75.4737,22))) } DataSet2:

Re: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-21 Thread hsy...@gmail.com

I have the same problem On Sat, Jul 19, 2014 at 12:31 AM, lihu lihu...@gmail.com wrote: Hi, Everyone. I have a piece of following code. When I run it, it occurred the error just like below, it seem that the SparkContext is not serializable, but i do not try to use the SparkContext

saveAsSequenceFile for DStream

2014-07-21 Thread Barnaby

First of all, I do not know Scala, but learning. I'm doing a proof of concept by streaming content from a socket, counting the words and write it to a Tachyon disk. A different script will read the file stream and print out the results. val lines = ssc.socketTextStream(args(0), args(1).toInt,

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao

This is a very interesting problem. SparkSQL supports the Non Equi Join, but it is in very low efficiency with large tables. One possible solution is make both table partition based and the partition keys are (cast(ds as bigint) / 240), and with each partition in dataset1, you probably can

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Tobias Pfeiffer

Bill, numPartitions means the number of Spark partitions that the data received from Kafka will be split to. It has nothing to do with Kafka partitions, as far as I know. If you create multiple Kafka consumers, it doesn't seem like you can specify which consumer will consume which Kafka

Understanding Spark

2014-07-21 Thread omergul123

Hi, I'm just a new one in the world big data and I'm trying understand the use cases of several projects. Of course one of them is Spark. I wanna know that what is the proper way of examining my data that resides on my MySQL server? Think that I'm saving every page view of a user with the

defaultMinPartitions in textFile

2014-07-21 Thread Wang, Jensen

Hi, I started to use spark on yarn recently and found a problem while tuning my program. When SparkContext is initialized as sc and ready to read text file from hdfs, the textFile(path, defaultMinPartitions) method is called. I traced down the second parameter in the spark source code

RE: data locality

2014-07-21 Thread Haopu Wang

Sandy, I just tried the standalone cluster and didn't have chance to try Yarn yet. So if I understand correctly, there are *no* special handling of task assignment according to the HDFS block's location when Spark is running as a *standalone* cluster. Please correct me if I'm wrong. Thank

Re: Question about initial message in graphx

2014-07-21 Thread Ankur Dave

On Mon, Jul 21, 2014 at 8:05 PM, Bin WU bw...@connect.ust.hk wrote: I am not sure how to specify different initial values to each node in the graph. Moreover, I am wondering why initial message is necessary. I think we can instead initialize the graph and then pass it into Pregel interface?

new error for me

2014-07-21 Thread Nathan Kronenfeld

Does anyone know what this error means: 14/07/21 23:07:22 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 14/07/21 23:07:22 INFO TaskSetManager: Starting task 3.0:0 as TID 1620 on executor 27: r104u05.oculus.local (PROCESS_LOCAL) 14/07/21 23:07:22 INFO TaskSetManager: Serialized task

Re: defaultMinPartitions in textFile

2014-07-21 Thread Ye Xianjin

well, I think you miss this line of code in SparkContext.scala line 1242-1243(master): /** Default min number of partitions for Hadoop RDDs when not given by user */ def defaultMinPartitions: Int = math.min(defaultParallelism, 2) so the defaultMinPartitions will be 2 unless the

RE: defaultMinPartitions in textFile

2014-07-21 Thread Wang, Jensen

Yes, Great. I thought it was math.max instead of math.min on that line. Thank you! From: Ye Xianjin [mailto:advance...@gmail.com] Sent: Tuesday, July 22, 2014 11:37 AM To: user@spark.apache.org Subject: Re: defaultMinPartitions in textFile well, I think you miss this line of code in

RE: Joining by timestamp.

2014-07-21 Thread durga

Hi Chen, I am new to the Spark as well as SparkSQL , could you please explain how would I create a table and run query on top of it.That would be super helpful. Thanks, D. -- View this message in context:

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao

Actually it's just a pseudo algorithm I described, you can do it with spark API. Hope the algorithm helpful. -Original Message- From: durga [mailto:durgak...@gmail.com] Sent: Tuesday, July 22, 2014 11:56 AM To: u...@spark.incubator.apache.org Subject: RE: Joining by timestamp. Hi Chen,

RE: Joining by timestamp.

2014-07-21 Thread durga

Hi Chen, Thank you very much for your reply. I think I do not understand how can I do the join using spark api. If you have time , could you please write some code . Thanks again, D. -- View this message in context:

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao

Durga, you can start from the documents http://spark.apache.org/docs/latest/quick-start.html http://spark.apache.org/docs/latest/programming-guide.html -Original Message- From: durga [mailto:durgak...@gmail.com] Sent: Tuesday, July 22, 2014 12:45 PM To:

83 matches

Mail list logo