Re: SQLContext.applySchema strictness

2015-02-15 Thread Michael Armbrust
Applying schema is a pretty low-level operation, and I would expect most users would use the type safe interfaces. If you are unsure you can always run: import org.apache.spark.sql.execution.debug._ schemaRDD.typeCheck() and it will tell you if you have made any mistakes. Michael On Sat, Feb

Re: Shuffle write increases in spark 1.2

2015-02-15 Thread Aaron Davidson
I think Xuefeng Wu's suggestion is likely correct. This different is more likely explained by the compression library changing versions than sort vs hash shuffle (which should not affect output size significantly). Others have reported that switching to lz4 fixed their issue. We should document

Re: Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread gen tang
Hi, You can use -a or --ami your ami id to launch the cluster using specific ami. If I remember well, the default system is Amazon Linux. Hope it will help Cheers Gen On Sun, Feb 15, 2015 at 6:20 AM, olegshirokikh o...@solver.com wrote: Hi there, Is there a way to specify the AWS AMI with

Percentile example

2015-02-15 Thread SiMaYunRui
hello, I am a newbie to spark and trying to figure out how to get percentile against a big data set. Actually, I googled this topic but not find any very useful code example and explanation. Seems that I can use transformer SortBykey to get my data set in order, but not pretty sure how can I

Re: Array in broadcast can't be serialized

2015-02-15 Thread Ted Yu
I was looking at https://github.com/twitter/chill It seems this would achieve what you want: chill-scala/src/main/scala/com/twitter/chill/WrappedArraySerializer.scala Cheers On Sat, Feb 14, 2015 at 6:36 PM, Tao Xiao xiaotao.cs@gmail.com wrote: I'm using Spark 1.1.0 and find that

Inconsistent execution times for same application.

2015-02-15 Thread Kartheek.R
Hi, My spark cluster contains machines like Pentium-4, dual core and quad-core machines. I am trying to run a character frequency count application. The application contains several threads, each submitting a job(action) that counts the frequency of a single character. But, my problem is, I get

Re: Multidimensional K-Means

2015-02-15 Thread Sean Owen
Clustering operates on a large number of n-dimensional vectors. That seems to be what you are describing, and that is what the MLlib API accepts. What are you expecting that you don't find? Did you have a look at the KMeansModel that this method returns? it has a clusterCenters method that gives

Re: SparkStreaming Low Performance

2015-02-15 Thread Akhil Das
Thanks Enno, let me have a look at Stream Parser version of Jackson. Thanks Best Regards On Sat, Feb 14, 2015 at 9:30 PM, Enno Shioji eshi...@gmail.com wrote: Huh, that would come to 6.5ms per one JSON. That does feel like a lot but if your JSON file is big enough, I guess you could get that

Re: shark queries failed

2015-02-15 Thread Akhil Das
I'd suggest you updating your spark to the latest version and try SparkSQL instead of Shark. Thanks Best Regards On Sun, Feb 15, 2015 at 7:36 AM, Grandl Robert rgra...@yahoo.com.invalid wrote: Hi guys, I deployed BlinkDB(built atop Shark) and running Spark 0.9. I tried to run several TPCDS

Re: Multidimensional K-Means

2015-02-15 Thread Attila Tóth
Hi Sean, Thanks for the quick answer. I have not realized that I can make an RDD[Vector] with eg. val dataSet = sparkContext.makeRDD(List(Vectors.dense(10.0,20.0), Vectors.dense(20.0,30.0))) Using this KMeans.train works as it should. So my bad. Thanks again! Attila 2015-02-15 17:29

spark-local dir running out of space during long ALS run

2015-02-15 Thread Antony Mayi
Hi, I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about 3 billions of ratings and I am doing several trainImplicit() runs in loop within one spark session. I have four node cluster with 3TB disk space on each. before starting the job there is less then 8% of the disk

Dynamic partition pattern support

2015-02-15 Thread Jianshi Huang
Hi, HCatalog allows you to specify the pattern of paths for partitions, which will be used by dynamic partition loading. https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions#HCatalogDynamicPartitions-ExternalTables Can we have similar feature in SparkSQL? Jira is here:

Multidimensional K-Means

2015-02-15 Thread Attila Tóth
Dear Spark User List, I'm fairly new to Spark, trying to use it for multi-dimensional clustering (using the k-means clustering from MLib). However, based on the examples the clustering seems to work only for a single dimension (KMeans.train() accepts an RDD[Vector], which is a vector of doubles -

Re: New ColumnType For Decimal Caching

2015-02-15 Thread Michael Armbrust
That sound right to me. Cheng could elaborate if you are missing something. On Fri, Feb 13, 2015 at 11:36 AM, Manoj Samel manojsamelt...@gmail.com wrote: Thanks Michael for the pointer Sorry for the delayed reply. Taking a quick inventory of scope of change - Is the column type for Decimal

shark queries failed

2015-02-15 Thread Grandl Robert
Hi guys, I deployed BlinkDB(built atop Shark) and running Spark 0.9.  I tried to run several TPCDS shark queries taken from https://github.com/cloudera/impala-tpcds-kit/tree/master/queries-sql92-modified/queries/shark. However, the following exceptions are encountered. Do you have any idea why

Re: Loading JSON dataset with Spark Mllib

2015-02-15 Thread gen tang
Hi, In fact, you can use sqlCtx.jsonFile() which loads a text file storing one JSON object per line as a SchemaRDD. Or you can use sc.textFile() to load the textFile to RDD and then use sqlCtx.jsonRDD() which loads an RDD storing one JSON object per string as a SchemaRDD. Hope it could help

Re: shark queries failed

2015-02-15 Thread Grandl Robert
Thanks for reply, Akhil. I cannot update the spark version and run SparkSQL due to some old dependencies and a specific project I want to run. I was wondering if you have any clue, why that exception might be triggered, or if you saw it before. Thanks,Robert On Sunday, February 15,

Re: Unable to query hive tables from spark

2015-02-15 Thread Todd Nist
What does your hive-site.xml look like? Do you actually have a directory at the location shown in the error? i.e does /user/hive/warehouse/src exist? You should be able to override this by specifying the following: --hiveconf hive.metastore.warehouse.dir=/location/where/your/warehouse/exists

Re: New ColumnType For Decimal Caching

2015-02-15 Thread Cheng Lian
Hi Manoj, Yes, you've already hit the point. I think timestamp type support in the in-memory columnar support can be a good reference for you. Also, you may want to enable compression support for decimal type by adding DECIMAL column type to RunLengthEncoding.supports and

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-15 Thread Takeshi Yamamuro
Hi, I tried quick and simple tests though, ISTM the vertices below were correctly cached. Could you give me the differences between my codes and yours? import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._ object Prog { def processInt(d: Int) = d * 2 } val g =

Re: Shuffle write increases in spark 1.2

2015-02-15 Thread Ami Khandeshi
I have seen same behavior! I would love to hear an update on this... Thanks, Ami On Thu, Feb 5, 2015 at 8:26 AM, Anubhav Srivastav anubhav.srivas...@gmail.com wrote: Hi Kevin, We seem to be facing the same problem as well. Were you able to find anything after that? The ticket does not

Array in broadcast can't be serialized

2015-02-15 Thread Tao Xiao
I'm using Spark 1.1.0 and find that *ImmutableBytesWritable* can be serialized by Kryo but *Array[ImmutableBytesWritable] *can't be serialized even when I registered both of them in Kryo. The code is as follows: val conf = new SparkConf() .setAppName(Hello Spark)

Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread olegshirokikh
Hi there, Is there a way to specify the AWS AMI with particular OS (say Ubuntu) when launching Spark on Amazon cloud with provided scripts? What is the default AMI, operating system that is launched by EC-2 script? Thanks -- View this message in context:

RE: Extract hour from Timestamp in Spark SQL

2015-02-15 Thread Cheng, Hao
Are you using the SQLContext? I think the HiveContext is recommended. Cheng Hao From: Wush Wu [mailto:w...@bridgewell.com] Sent: Thursday, February 12, 2015 2:24 PM To: u...@spark.incubator.apache.org Subject: Extract hour from Timestamp in Spark SQL Dear all, I am new to Spark SQL and have no

monit with spark

2015-02-15 Thread Mike Sam
We want to monitor spark master and spark slaves using monit but we want to use the sbin scripts to do so. The scripts create the spark master and salve processes independent from themselves so monit would not know the started processed pid to watch. Is this correct? Should we watch the ports?

Loading JSON dataset with Spark Mllib

2015-02-15 Thread pankaj channe
Hi, I am new to spark and planning on writing a machine learning application with Spark mllib. My dataset is in json format. Is it possible to load data into spark without using any external json libraries? I have explored the option of SparkSql but I believe that is only for interactive use or

Re: spark-local dir running out of space during long ALS run

2015-02-15 Thread Antony Mayi
spark.cleaner.ttl ? On Sunday, 15 February 2015, 18:23, Antony Mayi antonym...@yahoo.com wrote: Hi, I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about 3 billions of ratings and I am doing several trainImplicit() runs in loop within one spark session.

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-02-15 Thread matroyd
It works now using 1.2.1. Thanks for all the help. Spark rocks !! - Thanks, Roy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-HiveContext-created-SchemaRDD-s-saveAsTable-is-not-working-on-1-2-0-tp21442p21664.html Sent from the Apache Spark User

Re: Writing to HDFS from spark Streaming

2015-02-15 Thread Bahubali Jain
I used the latest assembly jar and the below as suggested by Akhil to fix this problem... temp.saveAsHadoopFiles(DailyCSV,.txt, String.class, String.class, *(Class)* TextOutputFormat.class); Thanks All for the help ! On Wed, Feb 11, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote: That

WARN from Similarity Calculation

2015-02-15 Thread Debasish Das
Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms exceeds 45000ms Do I need to increase the default 45 s to larger values for cases