Re: spark - reading hfds files every 5 minutes

2014-08-19 Thread Akhil Das
Spark Stre​aming https://spark.apache.org/docs/latest/streaming-programming-guide.html is the best fit for this use case. Basically you create a streaming context pointing to that directory, also you can set the streaming interval (in your case its 5 minutes). SparkStreaming will only process the

Problem in running a job on more than one workers

2014-08-19 Thread Rasika Pohankar
Hello, I am trying to run a job on two workers. I have cluster of 3 computers where one is the master and the other two are workers. I am able to successfully register the separate physical machines as workers in the cluster. When I run a job with a single worker connected, it runs

Naive Bayes

2014-08-19 Thread Phuoc Do
I'm trying Naive Bayes classifier for Higg Boson challenge on Kaggle: http://www.kaggle.com/c/higgs-boson Here's the source code I'm working on: https://github.com/dnprock/SparkHiggBoson/blob/master/src/main/scala/KaggleHiggBosonLabel.scala Training data looks like this:

Performance problem on collect

2014-08-19 Thread Emmanuel Castanier
Hi all, I’m totally newbie on Spark, so my question may be a dumb one. I tried Spark to compute values, on this side all works perfectly (and it's fast :) ). At the end of the process, I have an RDD with Key(String)/Values(Array of String), on this I want to get only one entry like this :

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
What is the ratio of examples labeled `s` to those labeled `b`? Also, Naive Bayes doesn't work on negative feature values. It assumes term frequencies as the input. We should throw an exception on negative feature values. -Xiangrui On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do phu...@vida.io wrote:

Re: Performance problem on collect

2014-08-19 Thread Sean Owen
You can use the function lookup() to accomplish this too; it may be a bit faster. It will never be efficient like a database lookup since this is implemented by scanning through all of the data. There is no index or anything. On Tue, Aug 19, 2014 at 8:43 AM, Emmanuel Castanier

Re: Problem in running a job on more than one workers

2014-08-19 Thread Rasika Pohankar
Hello, On the web UI of the master even though there are two workers shown, there is only one executor. There is an executor for machine1 but no executor for machine2. Hence if only machine1 is added as a worker the program runs but if only machine2 is added, it fails with the same error 'Master

Re: Performance problem on collect

2014-08-19 Thread Emmanuel Castanier
Thanks for your answer. In my case, that’s sad cause we have only 60 entries in the final RDD, I was thinking it will be fast to get the needed one. Le 19 août 2014 à 09:58, Sean Owen so...@cloudera.com a écrit : You can use the function lookup() to accomplish this too; it may be a bit

Re: Processing multiple files in parallel

2014-08-19 Thread Sean Owen
sc.textFile already returns just one RDD for all of your files. The sc.union is unnecessary, although I don't know if it's adding any overhead. The data is certainly processed in parallel and how it is parallelized depends on where the data is -- how many InputSplits Hadoop produces for them. If

Re: Performance problem on collect

2014-08-19 Thread Sean Owen
In that case, why not collectAsMap() and have the whole result as a simple Map in memory? then lookups are trivial. RDDs aren't distributed maps. On Tue, Aug 19, 2014 at 9:17 AM, Emmanuel Castanier emmanuel.castan...@gmail.com wrote: Thanks for your answer. In my case, that’s sad cause we have

RE: Cannot run program Rscript using SparkR

2014-08-19 Thread Stuti Awasthi
Thanks Shivaram, This was the issue. Now I have installed Rscript on all the nodes in Spark cluster and it works now bith from script as well as R prompt. Thanks Stuti Awasthi From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Tuesday, August 19, 2014 1:17 PM To: Stuti

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Patrick McGloin
Hi Amit, I think the type of the data contained in your RDD needs to be a known case class and not abstract for createSchemaRDD. This makes sense when you think it needs to know about the fields in the object to create the schema. I had the same issue when I used an abstract base class for a

Partitioning under Spark 1.0.x

2014-08-19 Thread losmi83
Hi guys, I want to create two RDD[(K, V)] objects and then collocate partitions with the same K on one node. When the same partitioner for two RDDs is used, partitions with the same K end up being on different nodes. Here is a small example that illustrates this: // Let's say I have 10

spark error when distinct on more than one cloume

2014-08-19 Thread wan...@testbird.com
sql:SELECT app_id,COUNT(DISTINCT app_id, macaddr) cut from object group by app_id  Error Log 14/08/19 17:58:26 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 158.6 KB, free 294.7 MB) Exception in thread main java.lang.RuntimeException: [1.36] failure:

Executor Memory, Task hangs

2014-08-19 Thread Laird, Benjamin
Hi all, I'm doing some testing on a small dataset (HadoopRDD, 2GB, ~10M records), with a cluster of 3 nodes Simple calculations like count take approximately 5s when using the default value of executor.memory (512MB). When I scale this up to 2GB, several Tasks take 1m or more (while most

Re: Executor Memory, Task hangs

2014-08-19 Thread Akhil Das
Looks like 1 worker is doing the job. Can you repartition the RDD? Also what is the number of cores that you allocated? Things like this, you can easily identify by looking at the workers webUI (default worker:8081) Thanks Best Regards On Tue, Aug 19, 2014 at 6:35 PM, Laird, Benjamin

Re: Executor Memory, Task hangs

2014-08-19 Thread Sean Owen
Given a fixed amount of memory allocated to your workers, more memory per executor means fewer executors can execute in parallel. This means it takes longer to finish all of the tasks. Set high enough, and your executors can find no worker with enough memory and so they all are stuck waiting for

Re: Executor Memory, Task hangs

2014-08-19 Thread Laird, Benjamin
Thanks Akhil and Sean. All three workers are doing the work and tasks stall simultaneously on all three. I think Sean hit on my issue. I've been under the impression that each application has one executor process per worker machine (not per core per machine). Is that incorrect? If an executor

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-19 Thread Peng Cheng
Unfortunately, After some research I found its just a side effect of how closure containing var works in scala: http://stackoverflow.com/questions/11657676/how-does-scala-maintains-the-values-of-variable-when-the-closure-was-defined the closure keep referring var broadcasted wrapper as a pointer,

How to configure SPARK_EXECUTOR_URI to access files from maprfs

2014-08-19 Thread Lee Strawther (lstrawth)
/CLUSTER1/MAIN/tmp/spark-1.0.1-bin-mapr3.tgz' to '/tmp/mesos/slaves/20140815-101817-3334820618-5050-32618-0/frameworks/20140819-101702-3334820618-5050-16778-0001/executors/20140815-101817-3334820618-5050-32618-0/runs/e56fffbe-942d-4b15-a798-a00401387927' cp: cannot stat `/usr/local/mesos-0.18.1

Re: Viewing web UI after fact

2014-08-19 Thread Grzegorz Białek
Hi, Is there any way view history of applications statistics in master ui after restarting master server? I have all logs ing /tmp/spark-events/ but when I start history server in this directory it says No Completed Applications Found. Maybe I could copy this logs to dir used by master server but

Re: Writing to RabbitMQ

2014-08-19 Thread jschindler
Thanks for the quick and clear response! I now have a better understanding of what is going on regarding the driver and worker nodes which will help me greatly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-to-RabbitMQ-tp11283p12386.html Sent

spark on disk executions

2014-08-19 Thread Oleg Ruchovets
Hi , We have ~ 1TB of data to process , but our cluster doesn't have sufficient memory for such data set. ( we have 5-10 machine cluster). Is it possible to process 1TB data using ON DISK options using spark? If yes where can I read about the configuration for ON DISK executions. Thanks

Updating shared data structure between executors

2014-08-19 Thread Tim Smith
Hi, I am writing some Scala code to normalize a stream of logs using an input configuration file (multiple regex patterns). To avoid re-starting the job, I can read in a new config file using fileStream and then turn the config file to a map. But I am unsure about how to update a shared map

Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Aaron
Hello,I have a relatively simple python program that works just find in local most (--master local) but produces a strange error when I try to run it via Yarn ( --deploy-mode client --master yarn) or just execute the code through pyspark.Here's the code:sc = SparkContext(appName=foo)input =

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Davies Liu
Could you post the completed stacktrace? On Tue, Aug 19, 2014 at 10:47 AM, Aaron aaron.doss...@target.com wrote: Hello, I have a relatively simple python program that works just find in local most (--master local) but produces a strange error when I try to run it via Yarn ( --deploy-mode

Re: Failed jobs show up as succeeded in YARN?

2014-08-19 Thread Arun Ahuja
We see this all the time as well, I don't the believe there is much a relationship before the Spark job status and the what Yarn shows as the status. On Mon, Aug 11, 2014 at 3:17 PM, Shay Rojansky r...@roji.org wrote: Spark 1.0.2, Python, Cloudera 5.1 (Hadoop 2.3.0) It seems that Python jobs

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Aaron
Sure thing, this is the stacktrace from pyspark. It's repeated a few times, but I think this is the unique stuff. Traceback (most recent call last): File stdin, line 1, in module File /opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py, line 583, in collect

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-19 Thread Yana Kadiyska
Sean, would this work -- rdd.mapPartitions { partition = Iterator(partition) }.foreach( // Some setup code here // save partition to DB // Some cleanup code here ) I tried a pretty simple example ... I can see that the setup and cleanup are executed on the executor node, once per

EC2 instances missing SSD drives randomly?

2014-08-19 Thread Andras Barjak
Hi, Using the spark 1.0.1 ec2 script I launched 35 m3.2xlarge instances. (I was using Singapore region.) Some of the instances we got without the ephemeral internal (non-EBS) SSD devices that are supposed to be connected to them. Some of them have these drives but not all, and there is no sign

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-19 Thread Sean Owen
I think you're looking for foreachPartition(). You've kinda hacked it out of mapPartitions(). Your case has a simple solution, yes. After saving to the DB, you know you can close the connection, since you know the use of the connection has definitely just finished. But it's not a simpler solution

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Aaron
These three lines of python code cause the error for me: sc = SparkContext(appName=foo) input = sc.textFile(hdfs://[valid hdfs path]) mappedToLines = input.map(lambda myline: myline.split(,)) The file I'm loading is a simple CSV. -- View this message in context:

noob: how to extract different members of a VertexRDD

2014-08-19 Thread spr
I'm a Scala / Spark / GraphX newbie, so may be missing something obvious. I have a set of edges that I read into a graph. For an iterative community-detection algorithm, I want to assign each vertex to a community with the name of the vertex. Intuitively it seems like I should be able to pull

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Davies Liu
This script run very well without your CSV file. Could download you CSV file into local disks, and narrow down to the lines which triggle this issue? On Tue, Aug 19, 2014 at 12:02 PM, Aaron aaron.doss...@target.com wrote: These three lines of python code cause the error for me: sc =

Re: spark - reading hfds files every 5 minutes

2014-08-19 Thread salemi
Thank you but how do you convert the stream to parquet file? Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-spark-reading-hfds-files-every-5-minutes-tp12359p12401.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread Ankur Dave
(+user) On Tue, Aug 19, 2014 at 12:05 PM, spr s...@yarcdata.com wrote: I want to assign each vertex to a community with the name of the vertex. As I understand it, you want to set the vertex attributes of a graph to the corresponding vertex ids. You can do this using Graph#mapVertices [1] as

Re: EC2 instances missing SSD drives randomly?

2014-08-19 Thread Jey Kottalam
I think you have to explicitly list the ephemeral disks in the device map when launching the EC2 instance. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html On Tue, Aug 19, 2014 at 11:54 AM, Andras Barjak andras.bar...@lynxanalytics.com wrote: Hi, Using the

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread spr
ankurdave wrote val g = ... val newG = g.mapVertices((id, attr) = id) // newG.vertices has type VertexRDD[VertexId], or RDD[(VertexId, VertexId)] Yes, that worked perfectly. Thanks much. One follow-up question. If I just wanted to get those values into a vanilla variable (not a

pyspark/yarn and inconsistent number of executors

2014-08-19 Thread Calvin
I've set up a YARN (Hadoop 2.4.1) cluster with Spark 1.0.1 and I've been seeing some inconsistencies with out of memory errors (java.lang.OutOfMemoryError: unable to create new native thread) when increasing the number of executors for a simple job (wordcount). The general format of my submission

RDD Grouping

2014-08-19 Thread TJ Klein
Hi, is there a way such that I can group items in an RDD together such that I can process them using parallelize/map Let's say I have data items with keys 1...1000 e.g. loading RDD = sc. newAPIHadoopFile(...).cache() Now, I would like them to be processed in chunks of e.g. tens

Re: RDD Grouping

2014-08-19 Thread Sean Owen
groupBy seems to be exactly what you want. val data = sc.parallelize(1 to 200) data.groupBy(_ % 10).values.map(...) This would let you process 10 Iterable[Int] in parallel, each of which is 20 ints in this example. It may not make sense to do this in practice, as you'd be shuffling a lot of

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread Ankur Dave
At 2014-08-19 12:47:16 -0700, spr s...@yarcdata.com wrote: One follow-up question. If I just wanted to get those values into a vanilla variable (not a VertexRDD or Graph or ...) so I could easily look at them in the REPL, what would I do? Are the aggregate data structures inside the

Re: Naive Bayes

2014-08-19 Thread Phuoc Do
Hi Xiangrui, Training data: 42945 s out of 124659. Test data: 42722 s out of 125341. The ratio is very much the same. I tried Decision Tree. It outputs 0 to 1 decimals. I don't quite understand it yet. Would feature scaling make it work for Naive Bayes? Phuoc Do On Tue, Aug 19, 2014 at 12:51

Only master is really busy at KMeans training

2014-08-19 Thread durin
When trying to use KMeans.train with some large data and 5 worker nodes, it would due to BlockManagers shutting down because of timeout. I was able to prevent that by adding spark.storage.blockManagerSlaveTimeoutMs 300 to the spark-defaults.conf. However, with 1 Million feature vectors,

saveAsTextFile hangs with hdfs

2014-08-19 Thread David
I have a simple spark job that seems to hang when saving to hdfs. When looking at the spark web ui, the job reached 97 of 100 tasks completed. I need some help determining why the job appears to hang. The job hangs on the saveAsTextFile() call.

Re: OpenCV + Spark : Where to put System.loadLibrary ?

2014-08-19 Thread kmatzen
Reviving this thread hoping I might be able to get an exact snippet for the correct way to do this in Scala. I had a solution for OpenCV that I thought was correct, but half the time the library was not loaded by time it was needed. Keep in mind that I am completely new at Scala, so you're going

Re: Spark RuntimeException due to Unsupported datatype NullType

2014-08-19 Thread Yin Huai
Hi Rafeeq, I think the following part triggered the bug https://issues.apache.org/jira/browse/SPARK-2908. [{*href:null*,rel:me}] It has been fixed. Can you try spark master and see if the error get resolved? Thanks, Yin On Mon, Aug 11, 2014 at 3:53 AM, rafeeq s rafeeq.ec...@gmail.com wrote:

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira tracking this issue. On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo ce...@zephyrhealthinc.com wrote: Thanks, Zhan for the follow up. But, do you know how I am supposed to set that table name on the jobConf? I don't have

Re: spark error when distinct on more than one cloume

2014-08-19 Thread Yin Huai
Hi, The SQLParser used by SQLContext is pretty limited. Instead, can you try HiveContext? Thanks, Yin On Tue, Aug 19, 2014 at 7:57 AM, wan...@testbird.com wan...@testbird.com wrote: sql:SELECT app_id,COUNT(DISTINCT app_id, macaddr) cut from object group by app_id *Error Log* 14/08/19

spark streaming - how to prevent that empty dstream to get written out to hdfs

2014-08-19 Thread salemi
Hi All, I have the following code and if the dstream is empty spark streaming writes empty files ti hdfs. How can I prevent it? val ssc = new StreamingContext(sparkConf, Minutes(1)) val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap)

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Cesar Arevalo
Thanks! Yeah, it may be related to that. I'll check out that pull request that was sent and hopefully that fixes the issue. I'll let you know, after fighting with this issue yesterday I had decided to just leave it on the side and return to it after, so it may take me a while to get back to you.

Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
update: hangs even when not writing to hdfs. I changed the code to avoid saveAsTextFile() and instead do a forEachParitition and log the results. This time it hangs at 96/100 tasks, but still hangs. I changed the saveAsTextFile to: stringIntegerJavaPairRDD.foreachPartition(p - {

Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
Not sure if this is helpful or not, but in one executor stderr log, I found this: 14/08/19 20:17:04 INFO CacheManager: Partition rdd_5_14 not found, computing it 14/08/19 20:17:04 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/08/19

spark-submit with Yarn

2014-08-19 Thread Arun Ahuja
Is there more documentation on using spark-submit with Yarn? Trying to launch a simple job does not seem to work. My run command is as follows: /opt/cloudera/parcels/CDH/bin/spark-submit \ --master yarn \ --deploy-mode client \ --executor-memory 10g \ --driver-memory 10g \

Re: spark-submit with Yarn

2014-08-19 Thread Marcelo Vanzin
On Tue, Aug 19, 2014 at 2:34 PM, Arun Ahuja aahuj...@gmail.com wrote: /opt/cloudera/parcels/CDH/bin/spark-submit \ --master yarn \ --deploy-mode client \ This should be enough. But when I view the job 4040 page, SparkUI, there is a single executor (just the driver node) and I see

Re: spark-submit with Yarn

2014-08-19 Thread Arun Ahuja
Yes, the application is overwriting it - I need to pass it as argument to the application otherwise it will be set as local. Thanks for the quick reply! Also, yes now the appTrackingUrl is set properly as well, before it just said unassigned. Thanks! Arun On Tue, Aug 19, 2014 at 5:47 PM,

Re: RDD Grouping

2014-08-19 Thread TJ Klein
Thanks a lot. Yes, this mapPartitions seems a better way of dealing with this problem as for groupBy() I need to collect() data before applying parallelize(), which is expensive. -- View this message in context:

Re: sqlContext.parquetFile(path) fails if path is a file but succeeds if a directory

2014-08-19 Thread chutium
it is definitively a bug, sqlContext.parquetFile should take both dir and single file as parameter. this if-check for isDir make no sense after this commit https://github.com/apache/spark/pull/1370/files#r14967550 i opened a ticket for this issue https://issues.apache.org/jira/browse/SPARK-3138

Re: spark-submit with Yarn

2014-08-19 Thread Andrew Or
The --master should override any other ways of setting the Spark master. Ah yes, actually you can set spark.master directly in your application through SparkConf. Thanks Marcelo. 2014-08-19 14:47 GMT-07:00 Marcelo Vanzin van...@cloudera.com: On Tue, Aug 19, 2014 at 2:34 PM, Arun Ahuja

Re: Processing multiple files in parallel

2014-08-19 Thread SK
Without the sc.union, my program crashes with the following error: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at

Re: Task's Scheduler Delay in web ui

2014-08-19 Thread Chris Fregly
Scheduling Delay is the time required to assign a task to an available resource. if you're seeing large scheduler delays, this likely means that other jobs/tasks are using up all of the resources. here's some more info on how to setup Fair Scheduling versus the default FIFO Scheduler:

First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Haoyuan Li
Hi folks, We've posted the first Tachyon meetup, which will be on August 25th and is hosted by Yahoo! (Limited Space): http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you there! Best, Haoyuan -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/

Re: First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Christopher Nguyen
Fantastic! Sent while mobile. Pls excuse typos etc. On Aug 19, 2014 4:09 PM, Haoyuan Li haoyuan...@gmail.com wrote: Hi folks, We've posted the first Tachyon meetup, which will be on August 25th and is hosted by Yahoo! (Limited Space): http://www.meetup.com/Tachyon/events/200387252/ . Hope

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-08-19 Thread Chris Fregly
this would be awesome. did a jira get created for this? I searched, but didn't find one. thanks! -chris On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Thanks a lot Xiangrui. This will help. On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com

Decision tree: categorical variables

2014-08-19 Thread Sameer Tilak
Hi All, Is there any example of MLlib decision tree handling categorical variables? My dataset includes few categorical variables (20 out of 100 features) so was interested in knowing how I can use the current version of decision tree implementation to handle this situation? I looked at the

High-Level Implementation Documentation

2014-08-19 Thread Kenny Ballou
Hey all, Other than reading the source (not a bad idea in and of iteself; something I will get to soon) I was hoping to find some high-level implementation documentation. Can anyone point me to such a document(s)? Thank you in advance. -Kenny -- :SIG:!0x1066BA71A5F56C58!:

Re: spark-submit with HA YARN

2014-08-19 Thread Sandy Ryza
Hi Matt, I checked in the YARN code and I don't see any references to yarn.resourcemanager.address. Have you made sure that your YARN client configuration on the node you're launching from contains the right configs? -Sandy On Mon, Aug 18, 2014 at 4:07 PM, Matt Narrell matt.narr...@gmail.com

Re: slower worker node in the cluster

2014-08-19 Thread Chris Fregly
perhaps creating Fair Scheduler Pools might help? there's no way to pin certain nodes to a pool, but you can specify minShares (cpu's). not sure if that would help, but worth looking in to. On Tue, Jul 8, 2014 at 7:37 PM, haopu hw...@qilinsoft.com wrote: In a standalone cluster, is there way

Re: Decision tree: categorical variables

2014-08-19 Thread Xiangrui Meng
The categorical features must be encoded into indices starting from 0: 0, 1, ..., numCategories - 1. Then you can provide the categoricalFeatureInfo map to specify which columns contain categorical features and the number of categories in each. Joseph is updating the user guide. But if you want to

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-08-19 Thread Xiangrui Meng
No. Please create one but it won't be able to catch the v1.1 train. -Xiangrui On Tue, Aug 19, 2014 at 4:22 PM, Chris Fregly ch...@fregly.com wrote: this would be awesome. did a jira get created for this? I searched, but didn't find one. thanks! -chris On Tue, Jul 8, 2014 at 1:30 PM,

Re: Multiple column families vs Multiple tables

2014-08-19 Thread chutium
ö_ö you should send this message to hbase user list, not spark user list... but i can give you some personal advice about this, keep column families as few as possible! at least, use some prefix of column qualifier could also be an idea. but read performance may be worse for your use case like

Re: Only master is really busy at KMeans training

2014-08-19 Thread Xiangrui Meng
There are only 5 worker nodes. So please try to reduce the number of partitions to the number of available CPU cores. 1000 partitions are too bigger, because the driver needs to collect to task result from each partition. -Xiangrui On Tue, Aug 19, 2014 at 1:41 PM, durin m...@simon-schaefer.net

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
The ratio should be okay. Could you try to pre-process the data and map -999.0 to 0 before calling NaiveBayes? Btw, I added a check to ensure nonnegative features values: https://github.com/apache/spark/pull/2038 -Xiangrui On Tue, Aug 19, 2014 at 1:39 PM, Phuoc Do phu...@vida.io wrote: Hi

Re: Multiple column families vs Multiple tables

2014-08-19 Thread Ted Yu
bq. does not do well with anything above two or three column families Current hbase releases, such as 0.98.x, would do better than the above. 5 column families should be accommodated. Cheers On Tue, Aug 19, 2014 at 3:06 PM, Wei Liu wei@stellarloyalty.com wrote: We are doing schema

Re: Multiple column families vs Multiple tables

2014-08-19 Thread Wei Liu
Chutium, thanks for your advices. I will check out your links. I sent the email to the wrong email address! Sorry for the spam. Wei On Tue, Aug 19, 2014 at 4:49 PM, chutium teng@gmail.com wrote: ö_ö you should send this message to hbase user list, not spark user list... but i can

Re: Is hive UDF are supported in HiveContext

2014-08-19 Thread chutium
there is no collect_list in hive 0.12 try this after this ticket is done https://issues.apache.org/jira/browse/SPARK-2706 i am also looking forward to this. -- View this message in context:

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Tobias Pfeiffer
Hi, On Tue, Aug 19, 2014 at 7:01 PM, Patrick McGloin mcgloin.patr...@gmail.com wrote: I think the type of the data contained in your RDD needs to be a known case class and not abstract for createSchemaRDD. This makes sense when you think it needs to know about the fields in the object to

Re: Naive Bayes

2014-08-19 Thread Phuoc Do
I replaced -999.0 with 0. Predictions still have same label. Maybe negative feature really messes it up. On Tue, Aug 19, 2014 at 4:51 PM, Xiangrui Meng men...@gmail.com wrote: The ratio should be okay. Could you try to pre-process the data and map -999.0 to 0 before calling NaiveBayes? Btw, I

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Evan Chan
That might not be enough. Reflection is used to determine what the fields are, thus your class might actually need to have members corresponding to the fields in the table. I heard that a more generic method of inputting stuff is coming. On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer

What about implementing various hypothesis test for Logistic Regression in MLlib

2014-08-19 Thread guxiaobo1982
Hi, From the documentation I think only the model fitting part is implement, what about the various hypothesis test and performance indexes used to evaluate the model fit? Regards, Xiaobo Gu

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Amit Kumar
Hi Evan, Patrick and Tobias, So, It worked for what I needed it to do. I followed Yana's suggestion of using parameterized type of [T : Product:ClassTag:TypeTag] more concretely, I was trying to make the query process a bit more fluent -some pseudocode but with correct types val

Re: OutOfMemory Error

2014-08-19 Thread Ghousia
Hi, Any further info on this?? Do you think it would be useful if we have a in memory buffer implemented that stores the content of the new RDD. In case the buffer reaches a configured threshold, content of the buffer are spilled to the local disk. This saves us from OutOfMememory Error.

JVM heap and native allocation questions

2014-08-19 Thread kmatzen
I'm trying to use Spark to process some data using some native function's I've integrated using JNI and I pass around a lot of memory I've allocated inside these functions. I'm not very familiar with the JVM, so I have a couple of questions. (1) Performance seemed terrible until I LD_PRELOAD'ed