Re: Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I was able to get this working by extending KryoRegistrator and setting the spark.kryo.registrator property. On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class

Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda
How many nodes do you have in your cluster, how many cores, what is the size of the memory? On Fri, Feb 13, 2015 at 12:42 AM, Manas Kar manasdebashis...@gmail.com wrote: Hi Arush, Mine is a CDH5.3 with Spark 1.2. The only change to my spark programs are -Dspark.driver.maxResultSize=3g

Re: Concurrent batch processing

2015-02-12 Thread Matus Faro
I've been experimenting with my configuration for couple of days and gained quite a bit of power through small optimizations, but it may very well be something I'm doing crazy that is causing this problem. To give a little bit of a background, I am in the early stages of a project that consumes a

Concurrent batch processing

2015-02-12 Thread Matus Faro
Hi, Please correct me if I'm wrong, in Spark Streaming, next batch will not start processing until the previous batch has completed. Is there any way to be able to start processing the next batch if the previous batch is taking longer to process than the batch interval? The problem I am facing

Re: can we insert and update with spark sql

2015-02-12 Thread Debasish Das
I thought more on it...can we provide access to the IndexedRDD through thriftserver API and let the mapPartitions query the API ? I am not sure if ThriftServer is as performant as opening up an API using other akka based frameworks (like play or spray)... Any pointers will be really helpful...

Master dies after program finishes normally

2015-02-12 Thread Manas Kar
Hi, I have a Hidden Markov Model running with 200MB data. Once the program finishes (i.e. all stages/jobs are done) the program hangs for 20 minutes or so before killing master. In the spark master the following log appears. 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught

Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda
What is your cluster configuration? Did you try looking at the Web UI? There are many tips here http://spark.apache.org/docs/1.2.0/tuning.html Did you try these? On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com wrote: Hi, I have a Hidden Markov Model running with 200MB

correct way to broadcast a variable

2015-02-12 Thread freedafeng
Suppose I have an object to broadcast and then use it in a mapper function, sth like follows, (Python codes) obj2share = sc.broadcast(Some object here) someRdd.map(createMapper(obj2share)).collect() The createMapper function will create a mapper function using the shared object's value. Another

Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8 GB as well. They are all 8 core machines. To answer Imran's question my configurations are thus. executor_total_max_heapsize = 18GB This problem happens at the end of my program. I don't have to run a lot of jobs to see

Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8 GB as well. They are all 8 core machines. To answer Imran's question my configurations are thus. executor_total_max_heapsize = 18GB This problem happens at the end of my program. I don't have to run a lot of jobs to see

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
The thing is that my time is perfectly valid... On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its with the timezone actually, you can either use an NTP to maintain accurate system clock or you can adjust your system time to match with the AWS one. You can do it

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Deborah Siegel
Hi Abe, I'm new to Spark as well, so someone else could answer better. A few thoughts which may or may not be the right line of thinking.. 1) Spark properties can be set on the SparkConf, and with flags in spark-submit, but settings on SparkConf take precedence. I think your jars flag for

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Sandy Ryza
What version of Java are you using? Core NLP dropped support for Java 7 in its 3.5.0 release. Also, the correct command line option is --jars, not --addJars. On Thu, Feb 12, 2015 at 12:03 PM, Deborah Siegel deborah.sie...@gmail.com wrote: Hi Abe, I'm new to Spark as well, so someone else

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread fightf...@163.com
Hi, patrick Really glad to get your reply. Yes, we are doing group by operations for our work. We know that this is common for growTable when processing large data sets. The problem actually goes to : Do we have any possible chance to self-modify the initialCapacity using specifically for our

Re: OutofMemoryError: Java heap space

2015-02-12 Thread Yifan LI
Thanks, Kelvin :) The error seems to disappear after I decreased both spark.storage.memoryFraction and spark.shuffle.memoryFraction to 0.2 And, some increase on driver memory. Best, Yifan LI On 10 Feb 2015, at 18:58, Kelvin Chu 2dot7kel...@gmail.com wrote: Since the stacktrace

Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread lihu
I try to use the multi-thread to use the Spark SQL query. some sample code just like this: val sqlContext = new SqlContext(sc) val rdd_query = sc.parallelize(data, part) rdd_query.registerTempTable(MyTable) sqlContext.cacheTable(MyTable) val serverPool = Executors.newFixedThreadPool(3) val

Re: Spark Streaming distributed batch locking

2015-02-12 Thread Arush Kharbanda
* We have an inbound stream of sensor data for millions of devices (which have unique identifiers). Spark Streaming can handel events in the ballpark of 100-500K records/sec/node - *so you need to decide on a cluster accordingly. And its scalable.* * We need to perform aggregation of this stream

Re: How to log using log4j to local file system inside a Spark application that runs on YARN?

2015-02-12 Thread Emre Sevinc
Marcelo and Burak, Thank you very much for your explanations. Now I'm able to see my logs. On Wed, Feb 11, 2015 at 7:52 PM, Marcelo Vanzin van...@cloudera.com wrote: For Yarn, you need to upload your log4j.properties separately from your app's jar, because of some internal issues that are too

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-12 Thread didmar
Ok, I would suggest adding SPARK_DRIVER_MEMORY in spark-env.sh, with a larger amount of memory than the default 512m -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21618.html Sent from

Re: spark, reading from s3

2015-02-12 Thread Franc Carter
Check that your timezone is correct as well, an incorrect timezone can make it look like your time is correct when it is skewed. cheers On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim kane.ist...@gmail.com wrote: The thing is that my time is perfectly valid... On Tue, Feb 10, 2015 at 10:50 PM,

Re: How to do broadcast join in SparkSQL

2015-02-12 Thread Dima Zhiyanov
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context:

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Anders Arpteg
The nm logs only seems to contain similar to the following. Nothing else in the same time range. Any help? 2015-02-12 20:47:31,245 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container

Re: Concurrent batch processing

2015-02-12 Thread Tathagata Das
So you have come across spark.streaming.concurrentJobs already :) Yeah, that is an undocumented feature that does allow multiple output operations to submitted in parallel. However, this is not made public for the exact reasons that you realized - the semantics in case of stateful operations is

RE: PySpark 1.2 Hadoop version mismatch

2015-02-12 Thread Michael Nazario
I looked at the environment which I ran the spark-submit command in, and it looks like there is nothing that could be messing with the classpath. Just to be sure, I checked the web UI which says the classpath contains: - The two jars I added: /path/to/avro-mapred-1.7.4-hadoop2.jar and

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
So I tried this: .mapPartitions(itr = { itr.grouped(300).flatMap(items = { myFunction(items) }) }) and I tried this: .mapPartitions(itr = { itr.grouped(300).flatMap(myFunction) }) I tried making myFunction a method, a function val, and even moving it into a singleton

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
It seems unlikely to me that it would be a 2.2 issue, though not entirely impossible. Are you able to find any of the container logs? Is the NodeManager launching containers and reporting some exit code? -Sandy On Thu, Feb 12, 2015 at 1:21 PM, Anders Arpteg arp...@spotify.com wrote: No, not

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Felix C
You would probably write to hdfs or check out https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html You might be able to retrofit it to you use case. --- Original Message --- From: Su She suhsheka...@gmail.com Sent: February 11, 2015 10:55 PM To: Felix C

Re: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Sean Owen
Not now, but see https://issues.apache.org/jira/browse/SPARK-3066 As an aside, it's quite expensive to make recommendations for all users. IMHO this is not something to do, if you can avoid it architecturally. For example, consider precomputing recommendations only for users whose probability of

Re: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Thanks, Sean! Glad to know it will be in the future release. On Thu, Feb 12, 2015 at 2:45 PM, Sean Owen so...@cloudera.com wrote: Not now, but see https://issues.apache.org/jira/browse/SPARK-3066 As an aside, it's quite expensive to make recommendations for all users. IMHO this is not

Re: Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread Michael Armbrust
It looks to me like perhaps your SparkContext has shut down due to too many failures. I'd look in the logs of your executors for more information. On Thu, Feb 12, 2015 at 2:34 AM, lihu lihu...@gmail.com wrote: I try to use the multi-thread to use the Spark SQL query. some sample code just

Re: Master dies after program finishes normally

2015-02-12 Thread Imran Rashid
The important thing here is the master's memory, that's where you're getting the GC overhead limit. The master is updating its UI to include your finished app when your app finishes, which would cause a spike in memory usage. I wouldn't expect the master to need a ton of memory just to serve the

Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Hi, I wonder if there is a way to do fast top N product recommendations for all users in training using mllib's ALS algorithm. I am currently calling public Rating http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/mllib/recommendation/Rating.html[] recommendProducts(int user,

Re: Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Michael Armbrust
You can start a JDBC server with an existing context. See my answer here: http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html On Thu, Feb 12, 2015 at 7:24 AM, Todd Nist tsind...@gmail.com wrote: I have a question with regards to accessing

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
Looks like my clock is in sync: -bash-4.1$ date curl -v s3.amazonaws.com Thu Feb 12 21:40:18 UTC 2015 * About to connect() to s3.amazonaws.com port 80 (#0) * Trying 54.231.12.24... connected * Connected to s3.amazonaws.com (54.231.12.24) port 80 (#0) GET / HTTP/1.1 User-Agent: curl/7.19.7

Re: How to do broadcast join in SparkSQL

2015-02-12 Thread Michael Armbrust
In Spark 1.3, parquet tables that are created through the datasources API will automatically calculate the sizeInBytes, which is used to broadcast. On Thu, Feb 12, 2015 at 12:46 PM, Dima Zhiyanov dimazhiya...@hotmail.com wrote: Hello Has Spark implemented computing statistics for Parquet

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
HI Sean, I am reading the paper of implicit training. Collaborative Filtering for Implicit Feedback Datasets http://labs.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf It mentioned To this end, let us introduce a set of binary variables p_ui, which indicates the preference of user u to item i. The

Predicting Class Probability with Gradient Boosting/Random Forest

2015-02-12 Thread nilesh
We are using Gradient Boosting/Random Forests that I have found provide the best results for our recommendations. My issue is that I need the probability of the 0/1 label, and not the predicted label. In the spark scala api, I see that the predict method also has an option to provide the

RE: spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
OK. I think I have to use None instead null, then it works. Still switching from Java. I can also just use the field name as what I assume. Great experience. From: java8...@hotmail.com To: user@spark.apache.org Subject: spark left outer join with java.lang.UnsupportedOperationException: empty

Re: exception with json4s render

2015-02-12 Thread Mohnish Kodnani
Any ideas on how to figure out what is going on when using json4s 3.2.11. I have a need to use 3.2.11 and just to see if things work I had downgraded to 3.2.10 and things started working. On Wed, Feb 11, 2015 at 11:45 AM, Charles Feduke charles.fed...@gmail.com wrote: I was having a similar

Re: Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread lihu
Thanks very much, you're right. I called the sc.stop() before the execute pool shutdown. On Fri, Feb 13, 2015 at 7:04 AM, Michael Armbrust mich...@databricks.com wrote: It looks to me like perhaps your SparkContext has shut down due to too many failures. I'd look in the logs of your

Re: Question about mllib als's implicit training

2015-02-12 Thread Sean Owen
Where there is no user-item interaction, you provide no interaction, not an interaction with strength 0. Otherwise your input is fully dense. On Thu, Feb 12, 2015 at 11:09 PM, Crystal Xing crystalxin...@gmail.com wrote: Hi, I have some implicit rating data, such as the purchasing data. I read

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
The more I'm thinking about this- I may try this instead: val myChunkedRDD: RDD[List[Event]] = inputRDD.mapPartitions(_ .grouped(300).toList) I wonder if this would work. I'll try it when I get back to work tomorrow. Yuyhao, I tried your approach too but it seems to be somehow moving all the

RE: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Ganelin, Ilya
Hi all - I've spent a while playing with this. Two significant sources of speed up that I've achieved are 1) Manually multiplying the feature vectors and caching either the user or product vector 2) By doing so, if one of the RDDs is a global it becomes possible to parallelize this step by

Re: Question about mllib als's implicit training

2015-02-12 Thread Sean Owen
This all describes how the implementation operates, logically. The matrix P is never formed, for sure, certainly not by the caller. The implementation actually extends to handle negative values in R too but it's all taken care of by the implementation. On Thu, Feb 12, 2015 at 11:29 PM, Crystal

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Many thanks! On Thu, Feb 12, 2015 at 3:31 PM, Sean Owen so...@cloudera.com wrote: This all describes how the implementation operates, logically. The matrix P is never formed, for sure, certainly not by the caller. The implementation actually extends to handle negative values in R too but

Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Hi, I have some implicit rating data, such as the purchasing data. I read the paper about the implicit training algorithm used in spark and it mentioned the for user-prodct pairs which do not have implicit rating data, such as no purchase, we need to provide the value as 0. This is different

spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
Hi, I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8 fields. I know that the first field from both files are IDs. I want to find all the IDs existed in the first file, but NOT in the 2nd file. I am coming with the following code in spark-shell. case class origAsLeft

Re: Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Todd Nist
Thanks Michael. I will give it a try. On Thu, Feb 12, 2015 at 6:00 PM, Michael Armbrust mich...@databricks.com wrote: You can start a JDBC server with an existing context. See my answer here:

Re: obtain cluster assignment in K-means

2015-02-12 Thread Robin East
KMeans.train actually returns a KMeansModel so you can use predict() method of the model e.g. clusters.predict(pointToPredict) or clusters.predict(pointsToPredict) first is a single Vector, 2nd is RDD[Vector] Robin On 12 Feb 2015, at 06:37, Shi Yu shiyu@gmail.com wrote: Hi there, I

Re: PySpark 1.2 Hadoop version mismatch

2015-02-12 Thread Sean Owen
No, mr1 should not be the issue here, and I think that would break other things. The OP is not using mr1. client 4 / server 7 means roughly client is Hadoop 1.x, server is Hadoop 2.0.x. Normally, I'd say I think you are packaging Hadoop code in your app by brining in Spark and its deps. Your app

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell
The map will start with a capacity of 64, but will grow to accommodate new data. Are you using the groupBy operator in Spark or are you using Spark SQL's group by? This usually happens if you are grouping or aggregating in a way that doesn't sufficiently condense the data created from each input

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Anders Arpteg
Interesting to hear that it works for you. Are you using Yarn 2.2 as well? No strange log message during startup, and can't see any other log messages since no executer gets launched. Does not seems to work in yarn-client mode either, failing with the exception below. Exception in thread main

Re: Can't access remote Hive table from spark

2015-02-12 Thread Zhan Zhang
When you log in, you have root access. Then you can do “su hdfs” or any other account. Then you can create hdfs directory and change permission, etc. Thanks Zhan Zhang On Feb 11, 2015, at 11:28 PM, guxiaobo1982 guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote: Hi Zhan, Yes, I found

Is there a separate mailing list for Spark Developers ?

2015-02-12 Thread Manoj Samel
d...@spark.apache.org http://apache-spark-developers-list.1001551.n3.nabble.com/ mentioned on http://spark.apache.org/community.html seems to be bouncing. Is there another one ?

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-12 Thread Tathagata Das
Can you give me the whole logs? TD On Tue, Feb 10, 2015 at 10:48 AM, Jon Gregg jonrgr...@gmail.com wrote: OK that worked and getting close here ... the job ran successfully for a bit and I got output for the first couple buckets before getting a java.lang.Exception: Could not compute split,

Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das
1. Can you try count()? Take often does not force the entire computation. 2. Can you give the full log. From the log it seems that the blocks are added to two nodes but the tasks seem to be launched to different nodes. I dont see any message removing the blocks. So need the whole log to debug

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Hi Tim, I think this code will still introduce shuffle even when you call repartition on each input stream. Actually this style of implementation will generate more jobs (job per each input stream) than union into one stream as called DStream.union(), and union normally has no special overhead as

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Kevin (Sangwoo) Kim
Apache Zeppelin also has a scheduler and then you can reload your chart periodically, Check it out: http://zeppelin.incubator.apache.org/docs/tutorial/tutorial.html On Fri Feb 13 2015 at 7:29:00 AM Silvio Fiorito silvio.fior...@granturing.com wrote: One method I’ve used is to publish each

Re: Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-12 Thread Tathagata Das
Could you come up with a minimal example through which I can reproduce the problem? On Tue, Feb 10, 2015 at 12:30 PM, conor fennell.co...@gmail.com wrote: I am getting the following error when I kill the spark driver and restart the job: 15/02/10 17:31:05 INFO CheckpointReader: Attempting to

Re: HiveContext in SparkSQL - concurrency issues

2015-02-12 Thread Harika
Hi, I've been reading about Spark SQL and people suggest that using HiveContext is better. So can anyone please suggest a solution to the above problem. This is stopping me from moving forward with HiveContext. Thanks Harika -- View this message in context:

Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
I have a temporal data set in which I'd like to be able to query using Spark SQL. The dataset is actually in Accumulo and I've already written a CatalystScan implementation and RelationProvider[1] to register with the SQLContext so that I can apply my SQL statements. With my current

Re: Using Spark SQL for temporal data

2015-02-12 Thread Michael Armbrust
Hi Corey, I would not recommend using the CatalystScan for this. Its lower level, and not stable across releases. You should be able to do what you want with PrunedFilteredScan

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Vladimir Protsenko
Thank you. That worked. 2015-02-12 20:03 GMT+04:00 Imran Rashid iras...@cloudera.com: You need to import the implicit conversions to PairRDDFunctions with import org.apache.spark.SparkContext._ (note that this requirement will go away in 1.3:

Re: HiveContext in SparkSQL - concurrency issues

2015-02-12 Thread Felix C
Your earlier call stack clearly states that it fails because the Derby metastore has already been started by another instance, so I think that is explained by your attempt to run this concurrently. Are you running Spark standalone? Do you have a cluster? You should be able to run spark in

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
Hi Gerard, Great write-up and really good guidance in there. I have to be honest, I don't know why but setting # of partitions for each dStream to a low number (5-10) just causes the app to choke/crash. Setting it to 20 gets the app going but with not so great delays. Bump it up to 30 and I

Re: Extract hour from Timestamp in Spark SQL

2015-02-12 Thread Michael Armbrust
This looks like your executors aren't running a version of spark with hive support compiled in. On Feb 12, 2015 7:31 PM, Wush Wu w...@bridgewell.com wrote: Dear Michael, After use the org.apache.spark.sql.hive.HiveContext, the Exception: java.util. NoSuchElementException: key not found: hour

Re: Is there a separate mailing list for Spark Developers ?

2015-02-12 Thread Ted Yu
dev@spark is active. e.g. see: http://search-hadoop.com/m/JW1q5zQ1Xw/renaming+SchemaRDD+-%253E+DataFramesubj=renaming+SchemaRDD+gt+DataFrame Cheers On Thu, Feb 12, 2015 at 8:09 PM, Manoj Samel manojsamelt...@gmail.com wrote: d...@spark.apache.org

Re: streaming joining multiple streams

2015-02-12 Thread Tathagata Das
Sorry for the late response. With the amount of data you are planning join, any system would take time. However, between Hive's MapRduce joins, and Spark's basic shuffle, and Spark SQL's join, the latter wins hands down. Furthermore, with the APIs of Spark and Spark Streaming, you will have to do

Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das
Hey Tim, Let me get the key points. 1. If you are not writing back to Kafka, the delay is stable? That is, instead of foreachRDD { // write to kafka } if you do dstream.count, then the delay is stable. Right? 2. If so, then Kafka is the bottleneck. Is the number of partitions, that you spoke of

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
1) Yes, if I disable writing out to kafka and replace it with some very light weight action is rdd.take(1), the app is stable. 2) The partitions I spoke of in the previous mail are the number of partitions I create from each dStream. But yes, since I do processing and writing out, per partition,

Re: Streaming scheduling delay

2015-02-12 Thread Cody Koeninger
outdata.foreachRDD( rdd = rdd.foreachPartition(rec = { val writer = new KafkaOutputService(otherConf(kafkaProducerTopic).toString, propsMap) writer.output(rec) }) ) So this is creating a new kafka producer for every new

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Anders Arpteg
No, not submitting from windows, from a debian distribution. Had a quick look at the rm logs, and it seems some containers are allocated but then released again for some reason. Not easy to make sense of the logs, but here is a snippet from the logs (from a test in our small test cluster) if you'd

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Silvio Fiorito
One method I’ve used is to publish each batch to a message bus or queue with a custom UI listening on the other end, displaying the results in d3.js or some other app. As far as I’m aware there isn’t a tool that will directly take a DStream. Spark Notebook seems to have some support for

Re: Master dies after program finishes normally

2015-02-12 Thread Akhil Das
Increasing your driver memory might help. Thanks Best Regards On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com wrote: Hi, I have a Hidden Markov Model running with 200MB data. Once the program finishes (i.e. all stages/jobs are done) the program hangs for 20 minutes

Why are there different parts in my CSV?

2015-02-12 Thread Su She
Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a

An interesting and serious problem I encountered

2015-02-12 Thread Landmark
Hi foks, My Spark cluster has 8 machines, each of which has 377GB physical memory, and thus the total maximum memory can be used for Spark is more than 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, where the key is an integer and the value is an integer array with

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
I replaced the writeToKafka statements with a rdd.count() and sure enough, I have a stable app with total delay well within the batch window (20 seconds). Here's the total delay lines from the driver log: 15/02/13 06:14:26 INFO JobScheduler: Total delay: 6.521 s for time 142380806 ms

How to sum up the values in the columns of a dataset in Scala?

2015-02-12 Thread Carter
I am new to Scala. I have a dataset with many columns, each column has a column name. Given several column names (these column names are not fixed, they are generated dynamically), I need to sum up the values of these columns. Is there an efficient way of doing this? I worked out a way by using

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Yes, you can try it. For example, if you have a cluster of 10 executors, 60 Kafka partitions, you can try to choose 10 receivers * 2 consumer threads, so each thread will consume 3 partitions ideally, if you increase the threads to 6, each threads will consume 1 partitions ideally. What I think

Re: Why are there different parts in my CSV?

2015-02-12 Thread Akhil Das
For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call.

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Vladimir Protsenko
Thank's for reply. I solved porblem with importing org.apache.spark.SparkContext._ by Imran Rashid suggestion. In the sake of interest, does JavaPairRDD intended for use from java? What is the purpose of this class? Does my rdd implicitly converted to it in some circumstances? 2015-02-12 19:42

Re: Using Spark SQL for temporal data

2015-02-12 Thread Michael Armbrust
I haven't been paying close attention to the JIRA tickets for PrunedFilteredScan but I noticed some weird behavior around the filters being applied when OR expressions were used in the WHERE clause. From what I was seeing, it looks like it could be possible that the start and end ranges you

Re: 8080 port password protection

2015-02-12 Thread Akhil Das
Just to add to what Arush said, you can go through these links: http://stackoverflow.com/questions/1162375/apache-port-proxy http://serverfault.com/questions/153229/password-protect-and-serve-apache-site-by-port-number Thanks Best Regards On Thu, Feb 12, 2015 at 10:43 PM, Arush Kharbanda

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Su She
Thanks Kevin for the link, I have had issues trying to install zeppelin as I believe it is not yet supported for CDH 5.3, and Spark 1.2. Please correct me if I am mistaken. On Thu, Feb 12, 2015 at 7:33 PM, Kevin (Sangwoo) Kim kevin...@apache.org wrote: Apache Zeppelin also has a scheduler and

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Hi Tim, I think maybe you can try this way: create Receiver per executor and specify thread for each topic large than 1, and the total number of consumer thread will be: total consumer = (receiver number) * (thread number), and make sure this total consumer is less than or equal to Kafka

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
Hi Saisai, If I understand correctly, you are suggesting that control parallelism by having number of consumers/executors at least 1:1 for number of kafka partitions. For example, if I have 50 partitions for a kafka topic then either have: - 25 or more executors, 25 receivers, each receiver set

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-12 Thread Sean Owen
Looking at the script, I'm not sure whether --driver-memory is supposed to work in standalone client mode. It's too late to set the driver's memory if the driver is what's already running. It specially handles the case where the value is the environment config though. Not sure, this might be on

apply function to all the elements of a rowMatrix

2015-02-12 Thread Donbeo
Hi, I need to apply a function to all the elements of a rowMatrix. How can I do that? Here there is a more detailed question http://stackoverflow.com/questions/28438908/spark-mllib-apply-function-to-all-the-elements-of-a-rowmatrix Thanks a lot! -- View this message in context:

Invoking updateStateByKey twice on the same RDD

2015-02-12 Thread harsha
Can I invoke UpdateStateByKey twice on the same RDD. My requirement is as follows. 1. Get the event stream from Kafka 2. UpdateStateByKey to aggregate and filter events based on timestamp 3. Do some processing and save results to Cassandra DB 4. UpdateStateByKey to remove keys based on logout

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Aniket Bhatnagar
This is tricky to debug. Check logs of node and resource manager of YARN to see if you can trace the error. In the past I have to closely look at arguments getting passed to YARN container (they get logged before attempting to launch containers). If I still don't get a clue, I had to check the

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-12 Thread poiuytrez
Very interesting. It works. When I set SPARK_DRIVER_MEMORY=83971m in spark-env.sh or spark-default.conf it works. However, when I set the --driver-memory option with spark submit, the memory is not allocated to the spark master. (the web ui shows the correct value of spark.driver.memory

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-12 Thread Rok Roskar
Hi again, I narrowed down the issue a bit more -- it seems to have to do with the Kryo serializer. When I use it, then this results in a Null Pointer: rdd = sc.parallelize(range(10)) d = {} from random import random for i in range(10) : d[i] = random() rdd.map(lambda x: d[x]).collect()

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi, I believe that partitionBy will use the same (default) partitioner on both RDDs. On 2015-02-12 17:12, Sean Owen wrote: Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com wrote: Hi Karlson, I think your

Test

2015-02-12 Thread Dima Zhiyanov
Sent from my iPhone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Use of nscala-time within spark-shell

2015-02-12 Thread Hammam
Hi All, Thanks in advance for your help. I have timestamp which I need to convert to datetime using scala. A folder contains the three needed jar files: joda-convert-1.5.jar joda-time-2.4.jar nscala-time_2.11-1.8.0.jar Using scala REPL and adding the jars: scala -classpath *.jar I can use

saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Vladimir Protsenko
Hi. I am stuck with how to save file to hdfs from spark. I have written MyOutputFormat extends FileOutputFormatString, MyObject, then in spark calling this: rddres.saveAsHadoopFile[MyOutputFormat](hdfs://localhost/output) or rddres.saveAsHadoopFile(hdfs://localhost/output, classOf[String],

Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Todd Nist
I have a question with regards to accessing SchemaRDD’s and Spark SQL temp tables via the thrift server. It appears that a SchemaRDD when created is only available in the local namespace / context and are unavailable to external services accessing Spark through thrift server via ODBC; is this

Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi All, using Pyspark, I create two RDDs (one with about 2M records (~200MB), the other with about 8M records (~2GB)) of the format (key, value). I've done a partitionBy(num_partitions) on both RDDs and verified that both RDDs have the same number of partitions and that equal keys reside on

spark mllib error when predict on linear regression model

2015-02-12 Thread Donbeo
Hi, I have a model and I am trying to predict regPoints. Here is the code that I have used. A more detailed question is available at http://stackoverflow.com/questions/28482476/spark-mllib-predict-error-with-map scala model res26: org.apache.spark.mllib.regression.LinearRegressionModel =

Re: 8080 port password protection

2015-02-12 Thread Arush Kharbanda
You could apply a password using a filter using a server. Though it dosnt looks like the right grp for the question. It can be done for spark also for Spark UI. On Thu, Feb 12, 2015 at 10:19 PM, MASTER_ZION (Jairo Linux) master.z...@gmail.com wrote: Hi everyone, Im creating a development

  1   2   >