RE: problem connection to hdfs on localhost from spark-shell

2014-08-29 Thread linkpatrickliu
Change your conf/spark-env.sh: export HADOOP_CONF_DIR=/etc/hadoop/confexport YARN_CONF_DIR=/etc/hadoop/conf Date: Thu, 28 Aug 2014 16:19:05 -0700 From: ml-node+s1001560n13074...@n3.nabble.com To: linkpatrick...@live.com Subject: problem connection to hdfs on localhost from spark-shell

u'' notation with pyspark output data

2014-08-29 Thread Oleg Ruchovets
Hi , I am working with pyspark and doing simple aggregation def doSplit(x): y = x.split(',') if(len(y)==3): return y[0],(y[1],y[2]) counts = lines.map(doSplit).groupByKey() output = counts.collect() Iterating over output I got such format of the data u'1385501280'

Re: RE: The concurrent model of spark job/stage/task

2014-08-29 Thread 35597...@qq.com
hi, dear Thanks for the response. Some comments below. and yes, I am using spark on yarn. 1. The release doc of spark says multi jobs can be submitted in one application if the jobs(actions) are submit by different threads. I wrote some java thread code in driver, one action in each thread,

Re: How to get prerelease thriftserver working?

2014-08-29 Thread Matt Chu
Hey Michael, Cheng, Thanks for the replies. Sadly I can't remember the specific error so I'm going to chalk it up to user error, especially since others on the list have not had a problem. @michael By the way, was at the Spark 1.1 meetup yesterday. Great event, very informative, cheers and keep

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
ok, but what if I have a long list do I need to hard code like this every element of my list of is there a function that translate a list into a tuple ? On Fri, Aug 29, 2014 at 3:24 AM, Michael Armbrust mich...@databricks.com wrote: You don't need the Seq, as in is a variadic function.

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-29 Thread Tim Smith
I upped the ulimit to 128k files on all nodes. Job crashed again with DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275. Couldn't get the logs because I killed the job and looks like yarn wipe the container logs (not sure why it wipes the logs under /var/log/hadoop-yarn/container).

Re: DStream repartitioning, performance tuning processing

2014-08-29 Thread Tim Smith
I set partitions to 64: // kInMsg.repartition(64) val outdata = kInMsg.map(x=normalizeLog(x._2,configMap)) // Still see all activity only on the two nodes that seem to be receiving from Kafka. On Thu, Aug 28, 2014 at 5:47 PM, Tim Smith secs...@gmail.com wrote: TD - Apologies, didn't realize

Re: Spark / Thrift / ODBC connectivity

2014-08-29 Thread Cheng Lian
You can use the Thrift server to access Hive tables that locates in legacy Hive warehouse and/or those generated by Spark SQL. Simba provides Spark SQL ODBC driver that enables applications like Tableau. But right now I'm not 100% sure about whether the driver has officially released yet. On

how can I get the number of cores

2014-08-29 Thread Kevin Jung
Hi all Spark web ui gives me the information about total cores and used cores. I want to get this information programmatically. How can I do this? Thanks Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-can-I-get-the-number-of-cores-tp13111.html

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Michael Armbrust
To pass a list to a variadic function you can use the type ascription :_* For example: val longList = Seq[Expression](a, b, ...) table(src).where('key in (longList: _*)) Also, note that I had to explicitly specify Expression as the type parameter of Seq to ensure that the compiler converts a

Re: u'' notation with pyspark output data

2014-08-29 Thread Davies Liu
u'14.0' means a unicode string, you can convert into str by u'14.0'.encode('utf8'), or you can convert it into float by float(u'14.0') Davies On Thu, Aug 28, 2014 at 11:22 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I am working with pyspark and doing simple aggregation def

Ensuring object in spark streaming runs on specific node

2014-08-29 Thread Filip Andrei
Say you have a spark streaming setup such as JavaReceiverInputDStream... rndLists = jssc.receiverStream(new JavaRandomReceiver(...)); rndLists.map(new NeuralNetMapper(...)) .foreach(new JavaSyncBarrier(...)); Is there any way of ensuring that, say, a JavaRandomReceiver and

Running Spark On Yarn without Spark-Submit

2014-08-29 Thread Archit Thakur
Hi, My requirement is to run Spark on Yarn without using the script spark-submit. I have a servlet and a tomcat server. As and when request comes, it creates a new SC and keeps it alive for the further requests, I ma setting my master in sparkConf as sparkConf.setMaster(yarn-cluster) but the

Re: how to specify columns in groupby

2014-08-29 Thread MEETHU MATHEW
Thank you Yanbo for the reply.. I 've another query related to cogroup.I want to iterate over the results of cogroup operation. My code is * grp = RDD1.cogroup(RDD2) * map((lambda (x,y): (x,list(y[0]),list(y[1]))), list(grp)) My result looks like : [((u'764', u'20140826'),

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
Still not working for me. I got a compilation error : *value in is not a member of Symbol.* Any ideas ? On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust mich...@databricks.com wrote: To pass a list to a variadic function you can use the type ascription :_* For example: val longList =

Re: Running Spark On Yarn without Spark-Submit

2014-08-29 Thread Archit Thakur
including user@spark.apache.org. On Fri, Aug 29, 2014 at 2:03 PM, Archit Thakur archit279tha...@gmail.com wrote: Hi, My requirement is to run Spark on Yarn without using the script spark-submit. I have a servlet and a tomcat server. As and when request comes, it creates a new SC and

Re: How to debug this error?

2014-08-29 Thread Yanbo Liang
It's not allowed to use RDD in map function. RDD can only operated at driver of spark program. At your case, group RDD can't be found at every executor. I guess you want to implement subquery like operation, try to use RDD.intersection() or join() 2014-08-29 12:43 GMT+08:00 Gary Zhao

RE: The concurrent model of spark job/stage/task

2014-08-29 Thread linkpatrickliu
Hi, I think an example will help illustrate the model better. /*** SimpleApp.scala ***/import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[String]) {val logFile = $YOUR_SPARK_HOME/README.md val sc = new SparkContext(local,

Re: Spark Hive max key length is 767 bytes

2014-08-29 Thread arthur.hk.c...@gmail.com
Hi, Tried the same thing in HIVE directly without issue: HIVE: hive create table test_datatype2 (testbigint bigint ); OK Time taken: 0.708 seconds hive drop table test_datatype2; OK Time taken: 23.272 seconds Then tried again in SPARK: scala val hiveContext = new

Re: how to filter value in spark

2014-08-29 Thread marylucy
i see it works well,thank you!!! But in follow situation how to do var a = sc.textFile(/sparktest/1/).map((_,a)) var b = sc.textFile(/sparktest/2/).map((_,b)) How to get (3,a) and (4,a) 在 Aug 28, 2014,19:54,Matthew Farrellee m...@redhat.com 写道: On 08/28/2014 07:20 AM, marylucy wrote:

Re: Running Spark On Yarn without Spark-Submit

2014-08-29 Thread Chester @work
Archit We are using yarn-cluster mode , and calling spark via Client class directly from servlet server. It works fine. To establish a communication channel to give further requests, It should be possible with yarn client, but not with yarn server. Yarn client mode, spark driver

Re: Where to save intermediate results?

2014-08-29 Thread huylv
Hi Daniel, Your suggestion is definitely an interesting approach. In fact, I already have another system to deal with the stream analytical processing part. So basically, the Spark job to aggregate data just accumulatively computes aggregations from historical data together with new batch, which

Re: how can I get the number of cores

2014-08-29 Thread Nicholas Chammas
What version of Spark are you running? Try calling sc.defaultParallelism. I’ve found that it is typically set to the number of worker cores in your cluster. ​ On Fri, Aug 29, 2014 at 3:39 AM, Kevin Jung itsjb.j...@samsung.com wrote: Hi all Spark web ui gives me the information about total

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-29 Thread Yana Kadiyska
I understand that the DB writes are happening from the workers unless you collect. My confusion is that you believe workers recompute on recovery(nodes computations which get redone upon recovery). My understanding is that checkpointing dumps the RDD to disk and the cuts the RDD lineage. So I

Re: Spark webUI - application details page

2014-08-29 Thread Brad Miller
How did you specify the HDFS path? When i put spark.eventLog.dir hdfs:// crosby.research.intel-research.net:54310/tmp/spark-events in my spark-defaults.conf file, I receive the following error: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. :

Spark Streaming reset state

2014-08-29 Thread Eko Susilo
Hi all, I would like to ask some advice about resetting spark stateful operation. so i tried like this: JavaStreamingContext jssc = new JavaStreamingContext(context, new Duration(5000)); jssc.remember(Duration(5*60*1000)); jssc.checkpoint(ApplicationConstants.HDFS_STREAM_DIRECTORIES);

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread bharatvenkat
Chris, I did the Dstream.repartition mentioned in the document on parallelism in receiving, as well as set spark.default.parallelism and it still uses only 2 nodes in my cluster. I notice there is another email thread on the same topic:

/tmp/spark-events permissions problem

2014-08-29 Thread Brad Miller
Hi All, Yesterday I restarted my cluster, which had the effect of clearing /tmp. When I brought Spark back up and ran my first job, /tmp/spark-events was re-created and the job ran fine. I later learned that other users were receiving errors when trying to create a spark context. It turned out

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Michael Armbrust
What version are you using? On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Still not working for me. I got a compilation error : *value in is not a member of Symbol.* Any ideas ? On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust mich...@databricks.com wrote:

Re: Spark Streaming reset state

2014-08-29 Thread Sean Owen
codes is a DStream, not an RDD. The remember() method controls how long Spark Streaming holds on to the RDDs itself. Clarify what you mean by reset? codes provides a stream of RDDs that contain your computation over a window of time. New RDDs come with the computation over new data. On Fri, Aug

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Jonathan Hodges
'this 2-node replication is mainly for failover in case the receiver dies while data is in flight. there's still chance for data loss as there's no write ahead log on the hot path, but this is being addressed.' Can you comment a little on how this will be addressed, will there be a durable WAL?

Re: Spark Streaming reset state

2014-08-29 Thread Eko Susilo
so the codes currently holding RDD containing codes and its respective counter. I would like to find a way to reset those RDD after some period of time. On Fri, Aug 29, 2014 at 5:55 PM, Sean Owen so...@cloudera.com wrote: codes is a DStream, not an RDD. The remember() method controls how long

RE: Q on downloading spark for standalone cluster

2014-08-29 Thread Sagar, Sanjeev
Hello Sparkies ! Could anyone please answer this? This is not an Hadoop cluster, so which download option should I use to download for standalone cluster ? Also what are the best practices if you’ve 1TB of data and want to use spark ? Do you’ve to use Hadoop/CDH or some other option ?

Announce: Smoke - a web frontend to Spark

2014-08-29 Thread Horacio G. de Oro
Hi everyone! I've been working on Smoke, a web frontend to interactively launch Spark jobs without compiling it (only support Scala right now, and launching the jobs on yarn-client mode). It works executing the Scala script using spark-shell in the Spark server. It's developed in Python, uses

Re: Spark Hive max key length is 767 bytes

2014-08-29 Thread Michael Armbrust
Spark SQL is based on Hive 12. They must have changed the maximum key size between 12 and 13. On Fri, Aug 29, 2014 at 4:38 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Tried the same thing in HIVE directly without issue: HIVE: hive create table test_datatype2

Re: Change delimiter when collecting SchemaRDD

2014-08-29 Thread yadid ayzenberg
Thanks Michael, that makes total sense. It works perfectly. Yadid On Thu, Aug 28, 2014 at 9:19 PM, Michael Armbrust mich...@databricks.com wrote: The comma is just the way the default toString works for Row objects. Since SchemaRDDs are also RDDs, you can do arbitrary transformations on

Problem Accessing Hive Table from hiveContext

2014-08-29 Thread Zitser, Igor
Hi All, New to spark and using Spark 1.0.2 and hive 0.12. If hive table created as test_datatypes(testbigint bigint, ss bigint )  select * from test_datatypes from spark works fine. For create table test_datatypes(testbigint bigint, testdec decimal(5,2) ) scala val

Re: Too many open files

2014-08-29 Thread SK
Hi, I am having the same problem reported by Michael. I am trying to open 30 files. ulimit -n shows the limit is 1024. So I am not sure why the program is failing with Too many open files error. The total size of all the 30 files is 230 GB. I am running the job on a cluster with 10 nodes, each

Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Victor Tso-Guillen
I'm thinking of local mode where multiple virtual executors occupy the same vm. Can we have the same configuration in spark standalone cluster mode?

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Matei Zaharia
Yes, executors run one task per core of your machine by default. You can also manually launch them with more worker threads than you have cores. What cluster manager are you on? Matei On August 29, 2014 at 11:24:33 AM, Victor Tso-Guillen (v...@paxata.com) wrote: I'm thinking of local mode

Re: DStream repartitioning, performance tuning processing

2014-08-29 Thread Tim Smith
I wrote a long post about how I arrived here but in a nutshell I don't see evidence of re-partitioning and workload distribution across the cluster. My new fangled way of starting the job is: run=`date +%m-%d-%YT%T`; \ nohup spark-submit --class logStreamNormalizer \ --master yarn

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Victor Tso-Guillen
Standalone. I'd love to tell it that my one executor can simultaneously serve, say, 16 tasks at once for an arbitrary number of distinct jobs. On Fri, Aug 29, 2014 at 11:29 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Yes, executors run one task per core of your machine by default. You can

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
1.0.2 On Friday, August 29, 2014, Michael Armbrust mich...@databricks.com wrote: What version are you using? On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa jaon...@gmail.com javascript:_e(%7B%7D,'cvml','jaon...@gmail.com'); wrote: Still not working for me. I got a compilation error

Re: DStream repartitioning, performance tuning processing

2014-08-29 Thread Tim Smith
Crash again. On the driver, logs say: 14/08/29 19:04:55 INFO BlockManagerMaster: Removed 7 successfully in removeExecutor org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:0 failed 4 times, most recent failure: TID 6383 on host node-dn1-2-acme.com failed for unknown

Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

2014-08-29 Thread Aris
Hi folks, I am trying to use Kafka with Spark Streaming, and it appears I cannot do the normal 'sbt package' as I do with other Spark applications, such as Spark alone or Spark with MLlib. I learned I have to build with the sbt-assembly plugin. OK, so here is my build.sbt file for my extremely

Re: SparkSql is slow over yarn

2014-08-29 Thread Nishkam Ravi
Can you share more details about your job, cluster properties and configuration parameters? Thanks, Nishkam On Fri, Aug 29, 2014 at 11:33 AM, Chirag Aggarwal chirag.aggar...@guavus.com wrote: When I run SparkSql over yarn, it runs 2-4 times slower as compared to when its run in local mode.

Re: Anyone know hot to submit spark job to yarn in java code?

2014-08-29 Thread Archit Thakur
Hi, I am facing the same problem. Did you find any solution or work around? Thanks and Regards, Archit Thakur. On Thu, Jan 16, 2014 at 6:22 AM, Liu, Raymond raymond@intel.com wrote: Hi Regarding your question 1) when I run the above script, which jar is beed submitted to the yarn

[PySpark] large # of partitions causes OOM

2014-08-29 Thread Nick Chammas
Here’s a repro for PySpark: a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is what I get: a = sc.parallelize([Nick, John, Bob]) a =

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Tim Smith
Good to see I am not the only one who cannot get incoming Dstreams to repartition. I tried repartition(512) but still no luck - the app stubbornly runs only on two nodes. Now this is 1.0.0 but looking at release notes for 1.0.1 and 1.0.2, I don't see anything that says this was an issue and has

Re: [Spark Streaming] kafka consumer announce

2014-08-29 Thread Evgeniy Shishkin
TD, can you please comment on this code? I am really interested in including this code in Spark. But i am bothering about some point about persistence: 1. When we extend Receiver and call store, is it blocking call? Does it return only when spark stores rdd as requested (i.e. replicated or

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Michael Armbrust
This feature was not part of that version. It will be in 1.1. On Fri, Aug 29, 2014 at 12:33 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: 1.0.2 On Friday, August 29, 2014, Michael Armbrust mich...@databricks.com wrote: What version are you using? On Fri, Aug 29, 2014 at 2:22 AM,

What is the better data structure in an RDD

2014-08-29 Thread cjwang
I need some advices regarding how data are stored in an RDD. I have millions of records, called Measures. They are bucketed with keys of String type. I wonder if I need to store them as RDD[(String, Measure)] or RDD[(String, Iterable[Measure])], and why? Data in each bucket are not related

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Tim Smith
I create my DStream very simply as: val kInMsg = KafkaUtils.createStream(ssc,zkhost1:2181/zk_kafka,testApp,Map(rawunstruct - 8)) . . eventually, before I operate on the DStream, I repartition it: kInMsg.repartition(512) Are you saying that ^^ repartition doesn't split by dstream into multiple

Re: Spark Streaming reset state

2014-08-29 Thread Christophe Sebastien
You can use a tuple associating a timestamp to your running sum; and have COMPUTE_RUNNING_SUM to reset the running sum to zero when the timestamp is more than 5 minutes old. You'll still have a leak doing so if your keys keep changing, though. --Christophe 2014-08-29 9:00 GMT-07:00 Eko Susilo

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Victor Tso-Guillen
Any more thoughts on this? I'm not sure how to do this yet. On Fri, Aug 29, 2014 at 12:10 PM, Victor Tso-Guillen v...@paxata.com wrote: Standalone. I'd love to tell it that my one executor can simultaneously serve, say, 16 tasks at once for an arbitrary number of distinct jobs. On Fri, Aug

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Tim Smith
Ok, so I did this: val kInStreams = (1 to 10).map{_ = KafkaUtils.createStream(ssc,zkhost1:2181/zk_kafka,testApp,Map(rawunstruct - 1)) } val kInMsg = ssc.union(kInStreams) val outdata = kInMsg.map(x=normalizeLog(x._2,configMap)) This has improved parallelism. Earlier I would only get a Stream 0.

RE: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-08-29 Thread Anton Brazhnyk
Just checked it with 1.0.2 Still same exception. From: Anton Brazhnyk [mailto:anton.brazh...@genesys.com] Sent: Wednesday, August 27, 2014 6:46 PM To: Tathagata Das Cc: user@spark.apache.org Subject: RE: [Streaming] Akka-based receiver with messages defined in uploaded jar Sorry for the delay

Re: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-08-29 Thread Tathagata Das
Can you try adding the JAR to the class path of the executors directly, by setting the config spark.executor.extraClassPath in the SparkConf. See Configuration page - http://spark.apache.org/docs/latest/configuration.html#runtime-environment I think what you guessed is correct. The Akka actor

Re: Too many open files

2014-08-29 Thread Ye Xianjin
Ops,the last reply didn't go to the user list. Mail app's fault. Shuffling happens in the cluster, so you need change all the nodes in the cluster. Sent from my iPhone On 2014年8月30日, at 3:10, Sudha Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. Do you know if I