spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Sujee Maniyam
Hi all I am unable to access s3n:// urls using sc.textFile().. getting 'no file system for scheme s3n://' error. a bug or some conf settings missing? See below for details: env variables : AWS_SECRET_ACCESS_KEY=set AWS_ACCESS_KEY_ID=set spark/RELAESE : Spark 1.3.1 (git revision 908a0bf)

RE: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Shuai Zheng
Below is my code to access s3n without problem (only for 1.3.1. there is a bug in 1.3.0). Configuration hadoopConf = ctx.hadoopConfiguration(); hadoopConf.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem);

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Ted Yu
This thread from hadoop mailing list should give you some clue: http://search-hadoop.com/m/LgpTk2df7822 On Wed, Apr 22, 2015 at 9:45 AM, Sujee Maniyam su...@sujee.net wrote: Hi all I am unable to access s3n:// urls using sc.textFile().. getting 'no file system for scheme s3n://' error.

Can I index a column in parquet file to make it join faster

2015-04-22 Thread Wang, Ningjun (LNG-NPV)
I have two RDDs each saved in a parquet file. I need to join this two RDDs by the id column. Can I created index on the id column so they can join faster? Here is the code case class Example(val id: String, val category: String) case class DocVector(val id: String, val vector: Vector) val

Re: Convert DStream to DataFrame

2015-04-22 Thread Sergio Jiménez Barrio
I tried the solution of the guide, but I exceded the size of case class Row: 2015-04-22 15:22 GMT+02:00 Tathagata Das tathagata.das1...@gmail.com: Did you checkout the latest streaming programming guide?

Re: Clustering algorithms in Spark

2015-04-22 Thread Jeetendra Gangele
does anybody have any thought on this? On 21 April 2015 at 20:57, Jeetendra Gangele gangele...@gmail.com wrote: The problem with k means is we have to define the no of cluster which I dont want in this case So thinking for something like hierarchical clustering any idea and suggestions?

Re: Convert DStream to DataFrame

2015-04-22 Thread Tathagata Das
Did you checkout the latest streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations You also need to be aware of that to convert json RDDs to dataframe, sqlContext has to make a pass on the data to learn the schema. This will

Re: Building Spark : Building just one module.

2015-04-22 Thread Iulian Dragoș
One way is to use export SPARK_PREPEND_CLASSES=true. This will instruct the launcher to prepend the target directories for each project to the spark assembly. I’ve had mixed experiences with it lately, but in principle that's the only way I know. ​ On Wed, Apr 22, 2015 at 3:42 PM, zia_kayani

Re: Auto Starting a Spark Job on Cluster Starup

2015-04-22 Thread Ted Yu
This thread seems related: http://search-hadoop.com/m/JW1q51W02V Cheers On Wed, Apr 22, 2015 at 6:09 AM, James King jakwebin...@gmail.com wrote: What's the best way to start-up a spark job as part of starting-up the Spark cluster. I have an single uber jar for my job and want to make the

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Jeff Nadler
You can run multiple Spark clusters against one ZK cluster. Just use this config to set independent ZK roots for each cluster: spark.deploy.zookeeper.dir The directory in ZooKeeper to store recovery state (default: /spark). -Jeff From: Sean Owen so...@cloudera.com To: Akhil Das

Re: Spark Performance on Yarn

2015-04-22 Thread Ted Yu
In master branch, overhead is now 10%. That would be 500 MB FYI On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote: +1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should

Master -chatter - Worker

2015-04-22 Thread James King
Is there a good resource that covers what kind of chatter (communication) that goes on between driver, master and worker processes? Thanks

the indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread yaochunnan
Hi all, I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really confused me today. At first I thought my implementation is wrong. It turns out it's an issue in MLlib. Fortunately, I've figured it out. I suggest to add a hint on user document of MLlib ( as far as I know,

Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Yin Huai
Xudong and Rex, Can you try 1.3.1? With PR 5339 http://github.com/apache/spark/pull/5339 , after we get a hive parquet from metastore and convert it to our native parquet code path, we will cache the converted relation. For now, the first access to that hive parquet table reads all of the footers

Re: Spark Performance on Yarn

2015-04-22 Thread nsalian
+1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should suffice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html Sent

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Akhil Das
nice, thanks for the information. Thanks Best Regards On Wed, Apr 22, 2015 at 8:53 PM, Jeff Nadler jnad...@srcginc.com wrote: You can run multiple Spark clusters against one ZK cluster. Just use this config to set independent ZK roots for each cluster: spark.deploy.zookeeper.dir

Re: Spark Performance on Yarn

2015-04-22 Thread Neelesh Salian
Does it still hit the memory limit for the container? An expensive transformation? On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu yuzhih...@gmail.com wrote: In master branch, overhead is now 10%. That would be 500 MB FYI On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote:

RDD.filter vs. RDD.join--advice please

2015-04-22 Thread hokiegeek2
Hi Everyone, I have two options of filtering the RDD resulting from the Graph.vertices method as illustrated with the following pseudo code: 1. Filter val vertexSet = Set(vertexOne,vertexTwo...); val filteredVertices = Graph.vertices.filter(x = vertexSet.contains(x._2.vertexName)) 2. Join

Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-22 Thread Mohammed Omer
Afternoon all, I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package` The error is encountered when running spark shell via: `spark-shell --packages com.databricks:spark-csv_2.11:1.0.3` The full trace of the

Re: Map Question

2015-04-22 Thread Tathagata Das
Is the mylist present on every executor? If not, then you have to pass it on. And broadcasts are the best way to pass them on. But note that once broadcasted it will immutable at the executors, and if you update the list at the driver, you will have to broadcast it again. TD On Wed, Apr 22, 2015

Re: regarding ZipWithIndex

2015-04-22 Thread Jeetendra Gangele
Sure thanks. if you can guide me how to do this will be great help. On 17 April 2015 at 22:05, Ted Yu yuzhih...@gmail.com wrote: I have some assignments on hand at the moment. Will try to come up with sample code after I clear the assignments. FYI On Thu, Apr 16, 2015 at 2:00 PM,

Spark streaming action running the same work in parallel

2015-04-22 Thread ColinMc
Hi, I'm running a unit test that keeps failing to work with the code I wrote in Spark. Here is the output logs from my test that I ran that gets the customers from incoming events in the JSON called QueueEvent and I am trying to convert the incoming events for each customer to an alert. From

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Tathagata Das
Furthermore, just to explain, doing arr.par.foreach does not help because it not really running anything, it only doing setup of the computation. Doing the setup in parallel does not mean that the jobs will be done concurrently. Also, from your code it seems like your pairs of dstreams dont

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Can I use broadcast vars in local mode? ᐧ On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das t...@databricks.com wrote: Yep. Not efficient. Pretty bad actually. That's why broadcast variable were introduced right at the very beginning of Spark. On Wed, Apr 22, 2015 at 10:58 AM, Vadim

Re: Map Question

2015-04-22 Thread Tathagata Das
Absolutely. The same code would work for local as well as distributed mode! On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Can I use broadcast vars in local mode? ᐧ On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das t...@databricks.com wrote: Yep. Not

Re: Map Question

2015-04-22 Thread Tathagata Das
Yep. Not efficient. Pretty bad actually. That's why broadcast variable were introduced right at the very beginning of Spark. On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Thanks TD. I was looking into broadcast variables. Right now I am running it

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele
will you be able to paste the code? On 23 April 2015 at 00:19, Adrian Mocanu amoc...@verticalscope.com wrote: Hi I use the ElasticSearch package for Spark and very often it times out reading data from ES into an RDD. How can I keep the connection alive (why doesn't it? Bug?) Here's

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi The gist of it is this: I have data indexed into ES. Each index stores monthly data and the query will get data for some date range (across several ES indexes or within 1 if the date range is within 1 month). Then I merge these RDDs into an uberRdd and performs some operations then print the

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele
Basically ready timeout means hat no data arrived within the specified receive timeout period. Few thing I would suggest 1.are your ES cluster Up and running? 2. if 1 is yes then reduce the size of the Index make it few kbps and then test? On 23 April 2015 at 00:19, Adrian Mocanu

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Tathagata Das
It could very well be that your executor memory is not enough to store the state RDDs AND operate on the data. 1G per executor is quite low. Definitely give more memory. And have you tried increasing the number of partitions (specify number of partitions in updateStateByKey) ? On Wed, Apr 22,

ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi I use the ElasticSearch package for Spark and very often it times out reading data from ES into an RDD. How can I keep the connection alive (why doesn't it? Bug?) Here's the exception I get: org.elasticsearch.hadoop.serialization.EsHadoopSerializationException:

Re: Efficient saveAsTextFile by key, directory for each key?

2015-04-22 Thread Arun Luthra
I ended up post-processing the result in hive with a dynamic partition insert query to get a table partitioned by the key. Looking further, it seems that 'dynamic partition' insert is in Spark SQL and working well in Spark SQL versions 1.2.0: https://issues.apache.org/jira/browse/SPARK-3007 On

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Sujee Maniyam
Thanks all... btw, s3n load works without any issues with spark-1.3.1-bulit-for-hadoop 2.4 I tried this on 1.3.1-hadoop26 sc.hadoopConfiguration.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) val f = sc.textFile(s3n://bucket/file) f.count No it can't find the

Spark SQL: SchemaRDD, DataFrame. Multi-value, Nested attributes

2015-04-22 Thread Eugene Morozov
Hi! I’m trying to query a dataset that reads data from csv and provides a SQL on top of it. The problem I have is I have a hierarchy of objects that I need to represent as a table so that users might use SQL to query it and do some aggregations. I do have multi value attributes (that in csv

RE: Scheduling across applications - Need suggestion

2015-04-22 Thread yana
Yes. Fair schedulwr only helps concurrency within an application.  With multiple shells you'd either need something like Yarn/Mesos or careful math on resources as you said Sent on the new Sprint Network from my Samsung Galaxy S®4. div Original message /divdivFrom: Arun Patel

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi Thanks for the help. My ES is up. Out of curiosity, do you know what the timeout value is? There are probably other things happening to cause the timeout; I don't think my ES is that slow but it's possible that ES is taking too long to find the data. What I see happening is that it uses

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-22 Thread Jean-Pascal Billaud
I have now a fair understanding of the situation after looking at javap output. So as a reminder: dstream.map(tuple = { val w = StreamState.fetch[K,W](state.prefixKey, tuple._1) (tuple._1, (tuple._2, w)) }) And StreamState being a very simple standalone object: object StreamState { def

Re: Map Question

2015-04-22 Thread Tathagata Das
Can you give full code? especially the myfunc? On Wed, Apr 22, 2015 at 2:20 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Here's what I did: print 'BROADCASTING...' broadcastVar = sc.broadcast(mylist) print broadcastVar print broadcastVar.value print 'FINISHED BROADCASTING...'

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Otis Gospodnetic
Hi, If you get ES response back in 1-5 seconds that's pretty slow. Are these ES aggregation queries? Costin may be right about GC possibly causing timeouts. SPM http://sematext.com/spm/ can give you all Spark and all key Elasticsearch metrics, including various JVM metrics. If the problem is

Re: StackOverflow Error when run ALS with 100 iterations

2015-04-22 Thread amghost
Hi, would you please how to checkpoint the training set rdd since all things are done in ALS.train method. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StackOverflow-Error-when-run-ALS-with-100-iterations-tp4296p22619.html Sent from the Apache Spark User

How to access HBase on Spark SQL

2015-04-22 Thread doovsaid
I notice that databricks provides external datasource api for Spark SQL. But I can't find any reference documents to guide how to access HBase based on it directly. Who know it? Or please give me some related links. Thanks. ZhangYi (张逸) BigEye website:

Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Rex Xiong
Yin, Thanks for your reply. We already patched this PR to our 1.3.0 As Xudong mentioned, we have thousand of parquet files, it's very very slow in first read, and another app will add more files and refresh table regularly. Cheng Lian's PR-5334 seems can resolve this issue, it will skip read all

Re: Start ThriftServer Error

2015-04-22 Thread Denny Lee
You may need to specify the hive port itself. For example, my own Thrift start command is in the form: ./sbin/start-thriftserver.sh --master spark://$myserver:7077 --driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host $myserver --hiveconf hive.server2.thrift.port 1 HTH!

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Costin Leau
Hi, First off, for Elasticsearch questions is worth pinging the Elastic mailing list as that is closer monitored than this one. Back to your question, Jeetendra is right that the exception indicates nodata is flowing back to the es-connector and Spark. The default is 1m [1] which should be

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-22 Thread Michael Armbrust
Sorry for the confusion. We should be more clear about the semantics in the documentation. (PRs welcome :) ) .saveAsTable does not create a hive table, but instead creates a Spark Data Source table. Here the metadata is persisted into Hive, but hive cannot read the tables (as this API support

spark-ec2 s3a filesystem support and hadoop versions

2015-04-22 Thread Daniel Mahler
I would like to easily launch a cluster that supports s3a file systems. if I launch a cluster with `spark-ec2 --hadoop-major-version=2`, what determines the minor version of hadoop? Does it depend on the spark version being launched? Are there other allowed values for --hadoop-major-version

why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-22 Thread Hao Ren
Hi, Just a quick question, Regarding the source code of groupByKey: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L453 In the end, it cast CompactBuffer to Iterable. But why ? Any advantage? Thank you. Hao. -- View this message

setting cost in linear SVM [Python]

2015-04-22 Thread Pagliari, Roberto
Is there a way to set the cost value C when using linear SVM?

beeline that comes with spark 1.3.0 doesn't work with --hiveconf or ''--hivevar which substitutes variables at hive scripts.

2015-04-22 Thread ogoh
Hello, I am using Spark 1.3 for SparkSQL (hive) ThriftServer Beeline. The Beeline doesn't work with --hiveconf or ''--hivevar which substitutes variables at hive scripts. I found the following jiras saying that Hive 0.13 resolved that issue. I wonder if this is well-known issue?

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-22 Thread Tathagata Das
Vaguely makes sense. :) Wow that's an interesting corner case. On Wed, Apr 22, 2015 at 1:57 PM, Jean-Pascal Billaud j...@tellapart.com wrote: I have now a fair understanding of the situation after looking at javap output. So as a reminder: dstream.map(tuple = { val w =

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Nick Pentreath
Is your ES cluster reachable from your Spark cluster via network / firewall? Can you run the same query from the spark master and slave nodes via curl / one of the other clients? Seems odd that GC issues would be a problem from the scan but not when running query from a browser plugin...

Loading lots of .parquet files in Spark 1.3.1 (Hadoop 2.4)

2015-04-22 Thread cosmincatalin
I am trying to read a few hundred .parquet files from S3 into an EMR cluster. The .parquet files are structured by date and have /_common_metadata/ in each of the folders (as well as /_metadata/).The *sqlContext.parquetFile* operation takes a very long time, opening for reading each of the

LDA code little error @Xiangrui Meng

2015-04-22 Thread buring
Hi: there is a little error in source code LDA.scala at line 180, as follows: def setBeta(beta: Double): this.type = setBeta(beta) which cause java.lang.StackOverflowError. It's easy to see there is error -- View this message in context:

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-22 Thread Xiangrui Meng
This is the size of the serialized task closure. Is stage 246 part of ALS iterations, or something before or after it? -Xiangrui On Tue, Apr 21, 2015 at 10:36 AM, Christian S. Perone christian.per...@gmail.com wrote: Hi Sean, thanks for the answer. I tried to call repartition() on the input

How to debug Spark on Yarn?

2015-04-22 Thread ๏̯͡๏
I submit a spark app to YARN and i get these messages 15/04/22 22:45:04 INFO yarn.Client: Application report for application_1429087638744_101363 (state: RUNNING) 15/04/22 22:45:04 INFO yarn.Client: Application report for application_1429087638744_101363 (state: RUNNING). ... 1) I can go to

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-22 Thread Xiangrui Meng
The patched was merged and it will be included in 1.3.2 and 1.4.0. Thanks for reporting the bug! -Xiangrui On Tue, Apr 21, 2015 at 2:51 PM, ayan guha guha.a...@gmail.com wrote: Thank you all. On 22 Apr 2015 04:29, Xiangrui Meng men...@gmail.com wrote: SchemaRDD subclasses RDD in 1.2, but

Re: Problem with using Spark ML

2015-04-22 Thread Xiangrui Meng
Please try reducing the step size. The native BLAS library is not required. -Xiangrui On Tue, Apr 21, 2015 at 5:15 AM, Staffan staffan.arvids...@gmail.com wrote: Hi, I've written an application that performs some machine learning on some data. I've validated that the data _should_ give a good

Re: [MLlib] fail to run word2vec

2015-04-22 Thread Xiangrui Meng
We store the vectors on the driver node. So it is hard to handle a really large vocabulary. You can use setMinCount to filter out infrequent word to reduce the model size. -Xiangrui On Wed, Apr 22, 2015 at 12:32 AM, gm yu husty...@gmail.com wrote: When use Mllib.Word2Vec, I meet the following

Re: the indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Xiangrui Meng
Having ordered indices is a contract of SparseVector: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector. We do not verify it for performance. -Xiangrui On Wed, Apr 22, 2015 at 8:26 AM, yaochunnan yaochun...@gmail.com wrote: Hi all, I am using

Re: LDA code little error @Xiangrui Meng

2015-04-22 Thread Xiangrui Meng
Thanks! That's a bug .. -Xiangrui On Wed, Apr 22, 2015 at 9:34 PM, buring qyqb...@gmail.com wrote: Hi: there is a little error in source code LDA.scala at line 180, as follows: def setBeta(beta: Double): this.type = setBeta(beta) which cause java.lang.StackOverflowError. It's easy

Re: StandardScaler failing with OOM errors in PySpark

2015-04-22 Thread Xiangrui Meng
What is the feature dimension? Did you set the driver memory? -Xiangrui On Tue, Apr 21, 2015 at 6:59 AM, rok rokros...@gmail.com wrote: I'm trying to use the StandardScaler in pyspark on a relatively small (a few hundred Mb) dataset of sparse vectors with 800k features. The fit method of

Re: RDD.filter vs. RDD.join--advice please

2015-04-22 Thread dsgriffin
Test it out, but I would be willing to bet the join is going to be a good deal faster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-filter-vs-RDD-join-advice-please-tp22612p22614.html Sent from the Apache Spark User List mailing list archive at

Re: Scheduling across applications - Need suggestion

2015-04-22 Thread Lan Jiang
YARN capacity scheduler support hierarchical queues, which you can assign cluster resource as percentage. Your spark application/shell can be submitted to different queues. Mesos supports fine-grained mode, which allows the machines/cores used each executors ramp up and down. Lan On Wed, Apr 22,

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Here's what I did: print 'BROADCASTING...' broadcastVar = sc.broadcast(mylist) print broadcastVar print broadcastVar.value print 'FINISHED BROADCASTING...' The above works fine, but when I call myrdd.map(myfunc) I get *NameError: global name 'broadcastVar' is not defined* The myfunc function

Re: SparkSQL performance

2015-04-22 Thread Michael Armbrust
https://github.com/databricks/spark-avro On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Thanks Michael! I have tried applying my schema programatically but I didn't get any improvement on performance :( Could you point me to some code examples

RE: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-22 Thread yana
You can try pulling the jar with wget and using it with -jars with Spark shell. I used 1.0.3 with Spark 1.3.0 but with a different version of scala. From the stack trace it looks like Spark shell is just not seeing the csv jar... Sent on the new Sprint Network from my Samsung Galaxy S®4.

Re: Join on DataFrames from the same source (Pyspark)

2015-04-22 Thread Karlson
DataFrames do not have the attributes 'alias' or 'as' in the Python API. On 2015-04-21 20:41, Michael Armbrust wrote: This is https://issues.apache.org/jira/browse/SPARK-6231 Unfortunately this is pretty hard to fix as its hard for us to differentiate these without aliases. However you can

Building Spark : Adding new DataType in Catalyst

2015-04-22 Thread zia_kayani
Hi, I am working on adding Geometry i.e. a new DataType into Spark catalyst, so that ROW can hold that object also, I've made a progress but its time taking as I've to compile the whole spark project, otherwise that changes aren't visible, I've tried to just build Spark SQL and Catalyst module

[MLlib] fail to run word2vec

2015-04-22 Thread gm yu
When use Mllib.Word2Vec, I meet the following error: allocating large array--thread_id[0x7ff2741ca000]--thread_name[Driver]--array_size[1146093680 bytes]--array_length[1146093656 elememts] prio=10 tid=0x7ff2741ca000 nid=0x1405f runnable at

Re: sparksql - HiveConf not found during task deserialization

2015-04-22 Thread Akhil Das
I see, now try a bit tricky approach, Add the hive jar to the SPARK_CLASSPATH (in conf/spark-env.sh file on all machines) and make sure that jar is available on all the machines in the cluster in the same path. Thanks Best Regards On Wed, Apr 22, 2015 at 11:24 AM, Manku Timma

Re: How does GraphX stores the routing table?

2015-04-22 Thread MUHAMMAD AAMIR
Hi Ankur, Thanks for the answer. However i still have following queries. On Wed, Apr 22, 2015 at 8:39 AM, Ankur Dave ankurd...@gmail.com wrote: On Tue, Apr 21, 2015 at 10:39 AM, mas mas.ha...@gmail.com wrote: How does GraphX stores the routing table? Is it stored on the master node or

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Akhil Das
You can enable this flag to run multiple jobs concurrently, It might not be production ready, but you can give it a try: sc.set(spark.streaming.concurrentJobs,2) ​Refer to TD's answer here

Hive table creation - possible bug in Spark 1.3?

2015-04-22 Thread Ophir Cohen
I wrote few mails here regarding this issue. After further investigation I think there is a bug in Spark 1.3 in saving Hive tables. (hc is HiveContext) 1. Verify the needed configuration exists: scala hc.sql(set hive.exec.compress.output).collect res4: Array[org.apache.spark.sql.Row] =

Re: Addition of new Metrics for killed executors.

2015-04-22 Thread twinkle sachdeva
Hi, Looks interesting. It is quite interesting to know about what could have been the reason for not showing these stats in UI. As per the description of Patrick W in https://spark-project.atlassian.net/browse/SPARK-999, it does not mention any exception w.r.t failed tasks/executors. Can

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Steve Loughran
the key thing would be to use different ZK paths for each cluster. You shouldn't need more than 2 ZK quorums even for a large (few thousand node) Hadoop clusters: one for the HA bits of the infrastructure (HDFS, YARN) and one for the applications to abuse. It's easy for apps using ZK to stick

Scheduling across applications - Need suggestion

2015-04-22 Thread Arun Patel
I believe we can use the properties like --executor-memory --total-executor-cores to configure the resources allocated for each application. But, in a multi user environment, shells and applications are being submitted by multiple users at the same time. All users are requesting resources with

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Sourav Chandra
Anyone? On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra sourav.chan...@livestream.com wrote: Hi Olivier, *the update function is as below*: *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long, Long)]) = {* * val previousCount = state.getOrElse((0L, 0L))._2* *

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Sean Owen
Not that i've tried it, but, why couldn't you use one ZK server? I don't see a reason. On Wed, Apr 22, 2015 at 7:40 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It isn't mentioned anywhere in the doc, but you will probably need separate ZK for each of your HA cluster. Thanks Best Regards

How to write Hive's map(key, value, ...) in Spark SQL DSL

2015-04-22 Thread Jianshi Huang
Hi, I want to write this in Spark SQL DSL: select map('c1', c1, 'c2', c2) as m from table Is there a way? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Error in creating spark RDD

2015-04-22 Thread madhvi
Hi, I am creating a spark RDD through accumulo writing like: JavaPairRDDKey, Value accumuloRDD = sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),AccumuloInputFormat.class,Key.class, Value.class); But I am getting the following error and it is not getting compiled: Bound mismatch: The

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Anyone ? On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu marius.dan...@gmail.com wrote: Hello anyone, I have a question regarding the sort shuffle. Roughly I'm doing something like: rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) The problem is that in f2 I don't see the