getting the cluster elements from kmeans run

2015-02-11 Thread Harini Srinivasan
Hi, Is there a way to get the elements of each cluster after running kmeans clustering? I am using the Java version. thanks

Re: Hive/Hbase for low latency

2015-02-11 Thread VISHNU SUBRAMANIAN
Hi Siddarth, It depends on what you are trying to solve. But the connectivity for cassandra and spark is good . The answer depends upon what exactly you are trying to solve. Thanks, Vishnu On Wed, Feb 11, 2015 at 7:47 PM, Siddharth Ubale siddharth.ub...@syncoms.com wrote: Hi , I am new

Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Yin Huai
Regarding backticks: Right. You need backticks to quote the column name timestamp because timestamp is a reserved keyword in our parser. On Tue, Feb 10, 2015 at 3:02 PM, Mohnish Kodnani mohnish.kodn...@gmail.com wrote: actually i tried in spark shell , got same error and then for some reason i

Re: How can I read this avro file using spark scala?

2015-02-11 Thread captainfranz
I am confused as to whether avro support was merged into Spark 1.2 or it is still an independent library. I see some people writing sqlContext.avroFile similarly to jsonFile but this does not work for me, nor do I see this in the Scala docs. -- View this message in context:

Re: Question related to Spark SQL

2015-02-11 Thread VISHNU SUBRAMANIAN
I dint mean that. When you try the above approach only one client will have access to the cached data. But when you expose your data through a thrift server the case is quite different. In the case of thrift server all the request goes to the thrift server and spark will be able to take the

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-11 Thread Andrew Or
Hi Jianshi, For YARN, there may be an issue with how a recently patch changes the accessibility of the shuffle files by the external shuffle service: https://issues.apache.org/jira/browse/SPARK-5655. It is likely that you will hit this with 1.2.1, actually. For this reason I would have to

SPARK_LOCAL_DIRS Issue

2015-02-11 Thread TJ Klein
Hi, Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a different path then local directory. On our cluster we have a folder for temporary files (in a central file system), which is called /scratch. When setting SPARK_LOCAL_DIRS=/scratch/node name I get: An error occurred while

Re: Re: How can I read this avro file using spark scala?

2015-02-11 Thread VISHNU SUBRAMANIAN
Check this link. https://github.com/databricks/spark-avro Home page for Spark-avro project. Thanks, Vishnu On Wed, Feb 11, 2015 at 10:19 PM, Todd bit1...@163.com wrote: Databricks provides a sample code on its website...but i can't find it for now. At 2015-02-12 00:43:07, captainfranz

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
A central location, such as NFS? If they are temporary for the purpose of further job processing you'll want to keep them local to the node in the cluster, i.e., in /tmp. If they are centralized you won't be able to take advantage of data locality and the central file store will become a

Re: Hive/Hbase for low latency

2015-02-11 Thread Ravi Kiran
Hi Siddharth, With v 4.3 of Phoenix, you can use the PhoenixInputFormat and OutputFormat classes to pull/push to Phoenix from Spark. HTH Thanks Ravi On Wed, Feb 11, 2015 at 6:59 AM, Ted Yu yuzhih...@gmail.com wrote: Connectivity to hbase is also avaliable. You can take a look at:

Re:Re: How can I read this avro file using spark scala?

2015-02-11 Thread Todd
Databricks provides a sample code on its website...but i can't find it for now. At 2015-02-12 00:43:07, captainfranz captainfr...@gmail.com wrote: I am confused as to whether avro support was merged into Spark 1.2 or it is still an independent library. I see some people writing

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Tassilo Klein
Thanks for the info. The file system in use is a Lustre file system. Best, Tassilo On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke charles.fed...@gmail.com wrote: A central location, such as NFS? If they are temporary for the purpose of further job processing you'll want to keep them

Re: exception with json4s render

2015-02-11 Thread Mohnish Kodnani
Same here.. I am a newbie to all this as well. But this is just what I found and I lack the expertise to figure out why things dont work in 3.2.11 json4s. May be some one in the group with more expertise can take a crack at it. But this is what unblocked me from moving forward. On Wed, Feb 11,

Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Mohnish Kodnani
that explains a lot... Is there a list of reserved keywords ? On Wed, Feb 11, 2015 at 7:56 AM, Yin Huai yh...@databricks.com wrote: Regarding backticks: Right. You need backticks to quote the column name timestamp because timestamp is a reserved keyword in our parser. On Tue, Feb 10, 2015

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
Take a look at this: http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf (linked from that article) to get a better idea of what your options are. If its possible to avoid writing to [any] disk I'd recommend that

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez
cat ../hadoop/spark-install/conf/spark-env.sh export SCALA_HOME=/home/hadoop/scala-install export SPARK_WORKER_MEMORY=83971m export SPARK_MASTER_IP=spark-m export SPARK_DAEMON_MEMORY=15744m export SPARK_WORKER_DIR=/hadoop/spark/work export SPARK_LOCAL_DIRS=/hadoop/spark/tmp export

Re: getting the cluster elements from kmeans run

2015-02-11 Thread VISHNU SUBRAMANIAN
You can use model.predict(point) that will help you identify the cluster center and map it to the point. rdd.map(x = (x,model.predict(x))) Thanks, Vishnu On Wed, Feb 11, 2015 at 11:06 PM, Harini Srinivasan har...@us.ibm.com wrote: Hi, Is there a way to get the elements of each cluster after

Re: getting the cluster elements from kmeans run

2015-02-11 Thread Suneel Marthi
KMeansModel only returns the cluster centroids. To get the # of elements in each cluster, try calling kmeans.predict() on each of the points in the data used to build the model. See

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez
cat ../hadoop/spark-install/conf/spark-env.sh export SCALA_HOME=/home/hadoop/scala-install export SPARK_WORKER_MEMORY=83971m export SPARK_MASTER_IP=spark-m export SPARK_DAEMON_MEMORY=15744m export SPARK_WORKER_DIR=/hadoop/spark/work export SPARK_LOCAL_DIRS=/hadoop/spark/tmp export

Re: Can spark job server be used to visualize streaming data?

2015-02-11 Thread Su She
Thank you Felix and Kelvin. I think I'll def be using the k-means tools in mlib. It seems the best way to stream data is by storing in hbase and then using an api in my viz to extract data? Does anyone have any thoughts on this? Thanks! On Tue, Feb 10, 2015 at 11:45 PM, Felix C

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com wrote: I was having trouble with memory exceptions when broadcasting a large lookup table, so I've resorted to processing it iteratively -- but how can I modify an RDD iteratively? I'm trying something like : rdd =

exception with json4s render

2015-02-11 Thread Jonathan Haddad
I'm trying to use the json4s library in a spark job to push data back into kafka. Everything was working fine when I was hard coding a string, but now that I'm trying to render a string from a simple map it's failing. The code works in sbt console. working console code:

Re: what is behind matrix multiplications?

2015-02-11 Thread Reza Zadeh
Yes, the local matrix is broadcast to each worker. Here is the code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L407 In 1.3 we will have Block matrix multiplication too, which will allow distributed matrix

Re: Can't access remote Hive table from spark

2015-02-11 Thread Zhan Zhang
You need to have right hdfs account, e.g., hdfs, to create directory and assign permission. Thanks. Zhan Zhang On Feb 11, 2015, at 4:34 AM, guxiaobo1982 guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote: Hi Zhan, My Single Node Cluster of Hadoop is installed by Ambari 1.7.0, I tried to

Re: exception with json4s render

2015-02-11 Thread Mohnish Kodnani
I was getting similar error after I upgraded to spark 1.2.1 from 1.1.1 Are you by any chance using json4s 3.2.11. I downgraded to 3.2.10 and that seemed to have worked. But I didnt try to spend much time debugging the issue than that. On Wed, Feb 11, 2015 at 11:13 AM, Jonathan Haddad

Re:Re: A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Todd
Thanks for the reply. I have the following Maven dependencies which looks correct to me? Maven: org.slf4j:slf4j-log4j12:1.7.5 Maven: org.slf4j:jcl-over-slf4j:1.7.5 Maven: org.slf4j:jul-to-slf4j:1.7.5 Maven: org.slf4j:slf4j-api:1.7.5 Maven: log4j:log4j:1.2.17 At 2015-02-11 23:27:54, Ted Yu

A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Todd
After compiling the Spark 1.2.0 codebase in Intellj Idea, and run the LocalPi example,I got the following slf4j related issue. Does anyone know how to fix this? Thanks Error:scalac: bad symbolic reference. A signature in Logging.class refers to type Logger in package org.slf4j which is not

Re: Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-11 Thread Peng Cheng
You are right. I've checked the overall stage metrics and looks like the largest shuffling write is over 9G. The partition completed successfully but its spilled file can't be removed until all others are finished. It's very likely caused by a stupid mistake in my design. A lookup table grows

Hive/Hbase for low latency

2015-02-11 Thread Siddharth Ubale
Hi , I am new to Spark . We have recently moved from Apache Storm to Apache Spark to build our OLAP tool . Now ,earlier we were using Hbase Phoenix. We need to re-think what to use in case of Spark. Should we go ahead with Hbase or Hive or Cassandra for query processing with Spark Sql. Please

Re: A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Ted Yu
Spark depends on slf4j 1.7.5 Please check your classpath and make sure slf4j is included. Cheers On Wed, Feb 11, 2015 at 6:20 AM, Todd bit1...@163.com wrote: After compiling the Spark 1.2.0 codebase in Intellj Idea, and run the LocalPi example,I got the following slf4j related issue. Does

RE: SparkSQL + Tableau Connector

2015-02-11 Thread Andrew Lee
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the logs since there were other activities going on on the cluster. From: alee...@hotmail.com To: ar...@sigmoidanalytics.com; tsind...@gmail.com CC: user@spark.apache.org Subject: RE: SparkSQL + Tableau Connector Date: Wed,

RE: Is the Thrift server right for me?

2015-02-11 Thread Judy Nash
It should relay the queries to spark (i.e. you shouldn't see any MR job on Hadoop you should see activities on the spark app on headnode UI). Check your hive-site.xml. Are you directing to the hive server 2 port instead of spark thrift port? Their default ports are both 1. From: Andrew

No executors allocated on yarn with latest master branch

2015-02-11 Thread Anders Arpteg
Hi, Compiled the latest master of Spark yesterday (2015-02-10) for Hadoop 2.2 and failed executing jobs in yarn-cluster mode for that build. Works successfully with spark 1.2 (and also master from 2015-01-16), so something has changed since then that prevents the job from receiving any executors

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
Aha great! Thanks for the clarification! On Feb 11, 2015 8:11 PM, Davies Liu dav...@databricks.com wrote: On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com wrote: I was having trouble with memory exceptions when broadcasting a large lookup table, so I've resorted to processing it

Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

2015-02-11 Thread Aris
Thank you Costin. I wrote out to the user list, I got no replies there. I will take this exact message and put it on the Github bug tracking system. One quick clarification: I read the elasticsearch documentation thoroughly, and I saw the warning about structured data vs. unstructured data, but

Re: Mesos coarse mode not working (fine grained does)

2015-02-11 Thread Hans van den Bogert
Bumping 1on1 conversation to mailinglist: On 10 Feb 2015, at 13:24, Hans van den Bogert hansbog...@gmail.com wrote: It’s self built, I can’t otherwise as I can’t install packages on the cluster here. The problem seems with libtool. When compiling Mesos on a host with apr-devel and

RE: Is the Thrift server right for me?

2015-02-11 Thread Andrew Lee
I have ThriftServer2 up and running, however, I notice that it relays the query to HiveServer2 when I pass the hive-site.xml to it. I'm not sure if this is the expected behavior, but based on what I have up and running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2015-02-11 Thread Lan
Hi Alexey and Daniel, I'm using Spark 1.2.0 and still having the same error, as described below. Do you have any news on this? Really appreciate your responses!!! a Spark cluster of 1 master VM SparkV1 and 1 worker VM SparkV4 (the error is the same if I have 2 workers). They are connected

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context:

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra
No, only each group should need to fit. On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet cjno...@gmail.com wrote: Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped =

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Ted Yu
See earlier thread: http://search-hadoop.com/m/JW1q5BZhf92 On Wed, Feb 11, 2015 at 3:04 PM, Dima Zhiyanov dimazhiya...@gmail.com wrote: Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in

Re: No executors allocated on yarn with latest master branch

2015-02-11 Thread Sandy Ryza
Hi Anders, I just tried this out and was able to successfully acquire executors. Any strange log messages or additional color you can provide on your setup? Does yarn-client mode work? -Sandy On Wed, Feb 11, 2015 at 1:28 PM, Anders Arpteg arp...@spotify.com wrote: Hi, Compiled the latest

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

2015-02-11 Thread Sina Samangooei
Hello, I have many questions about joins, but arguably just one. specifically about memory and containers that are overstepping their limits, as per errors dotted around all over the place, but something like:

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra
rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize) for (group - grouped) { ... } } On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet cjno...@gmail.com wrote: I think the word partition here is a tad different than the term partition that we use in Spark. Basically, I want

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context:

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

2015-02-11 Thread Sina Samangooei
Hello, I have many questions about joins, but arguably just one. specifically about memory and containers that are overstepping their limits, as per errors dotted around all over the place, but something like:

Re: Spark ML pipeline

2015-02-11 Thread Reynold Xin
Yes. Next release (Spark 1.3) is coming out end of Feb / early Mar. On Wed, Feb 11, 2015 at 7:22 AM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, I really like the pipeline in the spark.ml in Spark1.2 release. Will there be more machine learning algorithms implemented for the pipeline

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
the runtime for each consecutive iteration is still roughly twice as long as for the previous one -- is there a way to reduce whatever overhead is accumulating? On Feb 11, 2015, at 8:11 PM, Davies Liu dav...@databricks.com wrote: On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com

Re: Similar code in Java

2015-02-11 Thread Eduardo Costa Alfaia
Thanks Ted. On Feb 10, 2015, at 20:06, Ted Yu yuzhih...@gmail.com wrote: Please take a look at: examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaDirectKafkaWordCount.java which was checked in yesterday. On Sat, Feb 7, 2015 at 10:53 AM, Eduardo Costa Alfaia

Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
I think the word partition here is a tad different than the term partition that we use in Spark. Basically, I want something similar to Guava's Iterables.partition [1], that is, If I have an RDD[People] and I want to run an algorithm that can be optimized by working on 30 people at a time, I'd

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

2015-02-11 Thread Sina Samangooei
Hello, I have many questions about joins, but arguably just one. specifically about memory and containers that are overstepping their limits, as per errors dotted around all over the place, but something like:

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize) for (group - grouped) { ... } } On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet

SPARK_LOCAL_DIRS and SPARK_WORKER_DIR

2015-02-11 Thread gtinside
Hi , What is the difference between SPARK_LOCAL_DIRS and SPARK_WORKER_DIR ? Also does spark clean these up after the execution ? Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-LOCAL-DIRS-and-SPARK-WORKER-DIR-tp21612.html Sent from

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar rokros...@gmail.com wrote: the runtime for each consecutive iteration is still roughly twice as long as for the previous one -- is there a way to reduce whatever overhead is accumulating? Sorry, I didn't fully understand you question, which two are

Containers on EC2 instances go over their memory limits

2015-02-11 Thread Sina Samangooei
Hello, I have many questions about joins, but arguably just one. specifically about memory and containers that are overstepping their limits, as per errors dotted around all over the place, but something like:

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context:

Spark based ETL pipelines

2015-02-11 Thread Jagat Singh
Hi, I want to work on some use case something like below. Just want to know if something similar has been already done which can be reused. Idea is to use Spark for ETL / Data Science / Streaming pipeline. So when data comes inside the cluster front door we will do following steps 1) Upload

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Silvio Fiorito
Hey Todd, I don’t have an app to test against the thrift server, are you able to define custom SQL without using Tableau’s schema query? I guess it’s not possible to just use SparkSQL temp tables, you may have to use permanent Hive tables that are actually in the metastore so Tableau can

RE: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Yang, Yuhao
Check spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala It can be used through sliding(windowSize: Int) in spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala Yuhao From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: Thursday, February 12, 2015

feeding DataFrames into predictive algorithms

2015-02-11 Thread Sandy Ryza
Hey All, I've been playing around with the new DataFrame and ML pipelines APIs and am having trouble accomplishing what seems like should be a fairly basic task. I have a DataFrame where each column is a Double. I'd like to turn this into a DataFrame with a features column and a label column

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Todd Nist
First sorry for the long post. So back to tableau and Spark SQL, I'm still missing something. TL;DR To get the Spark SQL Temp table associated with the metastore are there additional steps required beyond doing the below? Initial SQL on connection: create temporary table test using

Re: Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
Ah, nevermind, I just saw http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language integrated queries) which looks quite similar to what i was thinking about. I'll give that a whirl... On Wed, Feb 11, 2015 at 7:40 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark. is

Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
Hi spark. is there anything in the works for a typesafe HQL like API for building spark queries from case classes ? i.e. where, given a domain object Product with a cost associated with it , we can do something like: query.select(Product).filter({ _.cost 50.00f

how to avoid Spark and Hive log from Application log

2015-02-11 Thread sachin Singh
Hi, Please can somebody help ,how to avoid Spark and Hive log from Application log, I mean both spark and hive are using log4j property file , I have configured log4j.property file as per my application as under but its printing Spark and hive console logging also,please suggest its urgent for me,

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-11 Thread fightf...@163.com
Hi, Really have no adequate solution got for this issue. Expecting any available analytical rules or hints. Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-09 11:56 To: user; dev Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data

Re: Can spark job server be used to visualize streaming data?

2015-02-11 Thread Felix C
What kind of data do you have? Kafka is a popular source to use with spark streaming. But, spark streaming also support reading from a file. Its called basic source https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers --- Original Message --- From:

obtain cluster assignment in K-means

2015-02-11 Thread Shi Yu
Hi there, I am new to spark. When training a model using K-means using the following code, how do I obtain the cluster assignment in the next step? val clusters = KMeans.train(parsedData, numClusters, numIterations) I searched around many examples but they mostly calculate the WSSSE. I am

Re: Can spark job server be used to visualize streaming data?

2015-02-11 Thread Su She
Hello Felix, I am already streaming in very simple data using Kafka (few messages / second, each record only has 3 columns...really simple, but looking to scale once I connect everything). I am processing it in Spark Streaming and am currently writing word counts to hdfs. So the part where I am

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
yes, sorry i wasn't clear -- I still have to trigger the calculation of the RDD at the end of each iteration. Otherwise all of the lookup tables are shipped to the cluster at the same time resulting in memory errors. Therefore this becomes several map jobs instead of one and each consecutive

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Thank you! The Hive solution seemed more like a workaround. I was wondering if a native Spark Sql support for computing statistics for Parquet files would be available Dima Sent from my iPhone On Feb 11, 2015, at 3:34 PM, Ted Yu yuzhih...@gmail.com wrote: See earlier thread:

Re: high GC in the Kmeans algorithm

2015-02-11 Thread Sean Owen
Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote: I give 50GB to the executor, so it seem that there is no reason the memory is not enough. On Wed, Feb 11, 2015

Re: high GC in the Kmeans algorithm

2015-02-11 Thread lihu
I just want to make the best use of CPU, and test the performance of spark if there is a lot of task in a single node. On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote: Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
One additional comment I would make is that you should be careful with Updates in Cassandra, it does support them but large amounts of Updates (i.e changing existing keys) tends to cause fragmentation. If you are (mostly) adding new keys (e.g new records in the the time series) then Cassandra can

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
I forgot to mention that if you do decide to use Cassandra I'd highly recommend jumping on the Cassandra mailing list, if we had taken in come of the advice on that list things would have been considerably smoother cheers On Wed, Feb 11, 2015 at 8:12 PM, Christian Betz

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Christian Betz
Hi Regarding the Cassandra Data model, there's an excellent post on the ebay tech blog: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/. There's also a slideshare for this somewhere. Happy hacking Chris Von: Franc Carter

Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data, from JSON documents is totally wrong!

2015-02-11 Thread Costin Leau
Aris, if you encountered a bug, it's best to raise an issue with the es-hadoop/spark project, namely here [1]. When using SparkSQL the underlying data needs to be present - this is mentioned in the docs as well [2]. As for the order, that does look like a bug and shouldn't occur. Note the

apply function to all elements of a row matrix

2015-02-11 Thread Donbeo
HI, I have a row matrix x scala x res3: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@63949747 and I would like to apply a function to each element of this matrix. I was looking for something like: x map (e = exp(-e*e)) How can I do

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Todd Nist
Hi Arush, So yes I want to create the tables through Spark SQL. I have placed the hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was all I should need to do to have the thriftserver use it. Perhaps my hive-site.xml is worng, it currently looks like this:

Re: How to efficiently utilize all cores?

2015-02-11 Thread Harika
Hi Aplysia, Thanks for the reply. Could you be more specific in terms of what part of the document to look at as I have already seen it and tried a few of the relevant settings for no use. -- View this message in context:

What do you think about the level of resource manager and file system?

2015-02-11 Thread Fangqi (Roy)
[cid:image004.jpg@01D04629.1F451950] [cid:image005.png@01D04629.1F451950] Hi guys~ Comparing these two architectures, why BDAS put Yarn and Mesos under the HDFS, do you have any special consideration? Or just easy to express the AMPLab stack? Best regards!

Re: Writing to HDFS from spark Streaming

2015-02-11 Thread Sean Owen
That kinda dodges the problem by ignoring generic types. But it may be simpler than the 'real' solution, which is a bit ugly. (But first, to double check, are you importing the correct TextOutputFormat? there are two versions. You use .mapred. with the old API and .mapreduce. with the new API.)

high GC in the Kmeans algorithm

2015-02-11 Thread lihu
Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I

OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez
Hello guys, I am trying to run a Ramdom Forest on 30MB of data. I have a cluster of 4 machines. Each machine has 106 MB of RAM and 16 cores. I am getting: 15/02/11 11:01:23 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-3] shutting down

Spark Streaming: Flume receiver with Kryo serialization

2015-02-11 Thread Antonio Jesus Navarro
Hi, I want to include if possible Kryo serialization in a project and first I'm trying to run FlumeEventCount with Kryo. If I comment setAll method, runs correctly, but if I use Kryo params it returns several errors. 15/02/11 11:42:16 ERROR SparkDeploySchedulerBackend: Asked to remove

PySpark 1.2 Hadoop version mismatch

2015-02-11 Thread Michael Nazario
Hi Spark users, I seem to be having this consistent error which I have been trying to reproduce and narrow down the problem. I've been running a PySpark application on Spark 1.2 reading avro files from Hadoop. I was consistently seeing the following error: py4j.protocol.Py4JJavaError: An

Unable to query hive tables from spark

2015-02-11 Thread kundan kumar
I want to create/access the hive tables from spark. I have placed the hive-site.xml inside the spark/conf directory. Even though it creates a local metastore in the directory where I run the spark shell and exists with an error. I am getting this error when I try to create a new hive table. Even

RE: PySpark 1.2 Hadoop version mismatch

2015-02-11 Thread Michael Nazario
I also forgot some other information. I have made this error go away by making my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but communicate with a spark 1.2 master and worker. It's not a good workaround, so I would like to have the driver also be spark 1.2 Michael

Extract hour from Timestamp in Spark SQL

2015-02-11 Thread Wush Wu
Dear all, I am new to Spark SQL and have no experience of Hive. I tried to use the built-in Hive Function to extract the hour from timestamp in spark sql, but got : java.util.NoSuchElementException: key not found: hour How should I extract the hour from timestamp? And I am very confusing about

Re: Strongly Typed SQL in Spark

2015-02-11 Thread Felix C
As far as from my tests, language integrated query in spark isn't type safe, ie. query.where('cost == foo) Would compile and return nothing. If you want type safety, perhaps you want to map the SchemaRDD to a RDD of Product (your type, not scala.Product) --- Original Message --- From: jay

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Mike Trienis
Thanks everyone for your responses. I'll definitely think carefully about the data models, querying patterns and fragmentation side-effects. Cheers, Mike. On Wed, Feb 11, 2015 at 1:14 AM, Franc Carter franc.car...@rozettatech.com wrote: I forgot to mention that if you do decide to use

Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Michael Armbrust
It sounds like you probably want to do a standard Spark map, that results in a tuple with the structure you are looking for. You can then just assign names to turn it back into a dataframe. Assuming the first column is your label and the rest are features you can do something like this: val df

Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Patrick Wendell
I think there is a minor error here in that the first example needs a tail after the seq: df.map { row = (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double])) }.toDataFrame(label, features) On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust mich...@databricks.com wrote: It sounds like

Re: Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-11 Thread nitin
I was able to resolve this use case (Thanks Cheng Lian) where I wanted to launch executor on just the specific partition while also getting the batch pruning optimisations of Spark SQL by doing following :- val query = sql(SELECT * FROM cac hedTable WHERE key = 1) val plannedRDD =

Re: how to debug this kind of error, e.g. lost executor?

2015-02-11 Thread Praveen Garg
Try increasing the value of spark.yarn.executor.memoryOverhead. It’s default value is 384mb in spark 1.1. This error generally comes when your process usage exceed your max allocation. Use following property to increase memory overhead. From: Yifan LI

Re: Streaming scheduling delay

2015-02-11 Thread Tim Smith
Just read the thread Are these numbers abnormal for spark streaming? and I think I am seeing similar results - that is - increasing the window seems to be the trick here. I will have to monitor for a few hours/days before I can conclude (there are so many knobs/dials). On Wed, Feb 11, 2015 at

Re: PySpark 1.2 Hadoop version mismatch

2015-02-11 Thread Akhil Das
Did you have a look at http://spark.apache.org/docs/1.2.0/building-spark.html I think you can simply download the source and build for your hadoop version as: mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package Thanks Best Regards On Thu, Feb 12, 2015 at 11:45 AM, Michael

Streaming scheduling delay

2015-02-11 Thread Tim Smith
On Spark 1.2 (have been seeing this behaviour since 1.0), I have a streaming app that consumes data from Kafka and writes it back to Kafka (different topic). My big problem has been Total Delay. While execution time is usually window size (in seconds), the total delay ranges from a minutes to

Re: Can't access remote Hive table from spark

2015-02-11 Thread guxiaobo1982
Hi Zhan, Yes, I found there is a hdfs account, which is created by Ambari, but what's the password for this account, how can I login under this account? Can I just change the password for the hdfs account? Regards, -- Original -- From: Zhan