Re: Better option to use Querying in Spark

2014-05-05 Thread Mayur Rustagi
All three have different usecases. If you are looking for more of a warehouse you are better off with Shark. SparkSQL is a way to query regular data in sql like syntax leveraging columnar store. BlinkDB is a experiment, meant to integrate with Shark in the long term. Not meant for production useca

Better option to use Querying in Spark

2014-05-05 Thread prabeesh k
Hi, I have seen three different ways to query data from Spark 1. Default SQL support( https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/sql/examples/HiveFromSpark.scala ) 2. Shark 3. Blink DB I would like know which one is more efficient Regard

Re: Apache spark on 27gb wikipedia data

2014-05-05 Thread Prashant Sharma
Try tuning the options like memoryFraction and executorMemory found here : http://spark.apache.org/docs/latest/configuration.html. Thanks Prashant Sharma On Mon, May 5, 2014 at 9:34 PM, Ajay Nair wrote: > Hi, > > Is there any way to overcome this error? I am running this from the > spark-shel

Re: mllib vector templates

2014-05-05 Thread David Hall
On Mon, May 5, 2014 at 3:40 PM, DB Tsai wrote: > David, > > Could we use Int, Long, Float as the data feature spaces, and Double for > optimizer? > Yes. Breeze doesn't allow operations on mixed types, so you'd need to convert the double vectors to Floats if you wanted, e.g. dot product with the

Re: mllib vector templates

2014-05-05 Thread Xiangrui Meng
I fixed index type and value type to make things simple, especially when we need to provide Java and Python APIs. For raw features and feature transmations, we should allow generic types. -Xiangrui On Mon, May 5, 2014 at 3:40 PM, DB Tsai wrote: > David, > > Could we use Int, Long, Float as the da

Re: mllib vector templates

2014-05-05 Thread DB Tsai
David, Could we use Int, Long, Float as the data feature spaces, and Double for optimizer? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 5, 2014 at 3:06 PM, David Hall wrote:

Re: mllib vector templates

2014-05-05 Thread David Hall
I should mention it shouldn't be too hard to change, but it is a current limitation. On May 5, 2014 3:12 PM, "Debasish Das" wrote: > Is any one facing issues due to this ? If not then I guess doubles are > fine... > > For me it's not a big deal as there is enough memory available... > > > On Mon,

Re: mllib vector templates

2014-05-05 Thread Debasish Das
Is any one facing issues due to this ? If not then I guess doubles are fine... For me it's not a big deal as there is enough memory available... On Mon, May 5, 2014 at 3:06 PM, David Hall wrote: > Lbfgs and other optimizers would not work immediately, as they require > vector spaces over doubl

Re: mllib vector templates

2014-05-05 Thread David Hall
Lbfgs and other optimizers would not work immediately, as they require vector spaces over double. Otherwise it should work. On May 5, 2014 3:03 PM, "DB Tsai" wrote: > Breeze could take any type (Int, Long, Double, and Float) in the matrix > template. > > > Sincerely, > > DB Tsai > ---

Re: mllib vector templates

2014-05-05 Thread DB Tsai
Breeze could take any type (Int, Long, Double, and Float) in the matrix template. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 5, 2014 at 2:56 PM, Debasish Das wrote: > Is th

Re: mllib vector templates

2014-05-05 Thread Debasish Das
Is this a breeze issue or breeze can take templates on float / double ? If breeze can take templates then it is a minor fix for Vectors.scala right ? Thanks. Deb On Mon, May 5, 2014 at 2:45 PM, DB Tsai wrote: > +1 Would be nice that we can use different type in Vector. > > > Sincerely, > > D

Re: mllib vector templates

2014-05-05 Thread DB Tsai
+1 Would be nice that we can use different type in Vector. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 5, 2014 at 2:41 PM, Debasish Das wrote: > Hi, > > Why mllib vector is

mllib vector templates

2014-05-05 Thread Debasish Das
Hi, Why mllib vector is using double as default ? /** * Represents a numeric vector, whose index type is Int and value type is Double. */ trait Vector extends Serializable { /** * Size of the vector. */ def size: Int /** * Converts the instance to a double array. *

Re: Mailing list

2014-05-05 Thread Matei Zaharia
> The script you're talking about, is it merge_spark_pr.py [1] ? Yup, that’s it. > >> Note by the way that using GitHub is not at all necessary for using Git. We >> happened to do our development on GitHub before moving to the ASF, and all >> our developers were used to its interface, so we s

Re: Apache spark on 27gb wikipedia data

2014-05-05 Thread Ajay Nair
Hi, Is there any way to overcome this error? I am running this from the spark-shell, is that the cause of concern ? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-spark-on-27gb-wikipedia-data-tp6487p6490.html Sent from the Apache Spark Develop

Re: bug using kryo as closure serializer

2014-05-05 Thread Soren Macbeth
I just took a peek at KryoSerializer and it looks like you're already using all the scala stuff from chill in there, so I would imagine that scala things should serialize pretty well. Seems like the readonly bytebuffer thing might be some other sort of downstream bug. On Sun, May 4, 2014 at 10:2

Re: Apache spark on 27gb wikipedia data

2014-05-05 Thread Prashant Sharma
I just thought may be we could put a warning whenever that error comes user can tune either memoryFraction or executor memory options. And this warning get's displayed when TaskSetManager receives task failures due to OOM. Prashant Sharma On Mon, May 5, 2014 at 2:10 PM, Ajay Nair wrote: > Hi,

Apache spark on 27gb wikipedia data

2014-05-05 Thread Ajay Nair
Hi, I am using 1 master and 3 slave workers for processing 27gb of Wikipedia data that is tab separated and every line contains wikipedia page information. The tab separated data has title of the page and the page contents. I am using the regular expression to extract links as mentioned in the sit