Re: Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
I can do a wget http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and get the file successfully on a shell. On Thu, Jan 29, 2015 at 11:51 AM, Boromir Widas vcsub...@gmail.com wrote: At least a part of it is due to connection refused, can you check if curl can reach the

Re: Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Does the error change on build with and without the built options? What do you mean by build options? I'm just doing ./sbt/sbt assembly from $SPARK_HOME Did you try using maven? and doing the proxy settings

Re: Building Spark behind a proxy

2015-01-29 Thread Boromir Widas
At least a part of it is due to connection refused, can you check if curl can reach the URL with proxies - [FATAL] Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:14 from/to central ( http://repo.maven.apache.org/maven2): Error transferring file: Connection refused

RE: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread Wang, Ningjun (LNG-NPV)
Install virtual box which run Linux? That does not help us. We have business reason to run it on Windows operating system, e.g. Windows 2008 R2. If anybody have done that, please give some advise on what version of spark, which version of Hadoop do you built spark against, etc…. Note that we

Re: Spark and S3 server side encryption

2015-01-29 Thread Ted Yu
fs.s3a.server-side-encryption-algorithm is honored by s3a support in hadoop 2.6.0+ as well. Cheers On Thu, Jan 29, 2015 at 6:51 AM, Danny kont...@dannylinden.de wrote: On Spark 1.2.0 you have the s3a library to work with S3. And there is a config param named

schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Manoj Samel
Spark 1.2 on Hadoop 2.3 Read one big csv file, create a schemaRDD on it and saveAsParquetFile. It creates a large number of small (~1MB ) parquet part-x- files. Any way to control so that smaller number of large files are created ? Thanks,

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-29 Thread Emre Sevinc
Charles, Thank you very much for another suggestion. Unfortunately I couldn't make it work that way either. So I downgraded my SolrJ library from 4.10.3 to 4.0.0 [1]. Maybe using Relocating Classes [2] feature of Maven could handle this issue, but I did not want to complicate my pom.xml further,

Running a custom setup task on all workers

2015-01-29 Thread Noam Barcay
Hello fellow Sparkians, Is there's some preferred way to have *some given set-up task run on all workers?*The task at hand isn't a computational task then, but rather some initial setup I want to run it for its *side-effects*. This could be to set-up some custom logging settings, or metrics.

Re: SQL query over (Long, JSON string) tuples

2015-01-29 Thread Ayoub
Hello, SQLContext and hiveContext have a jsonRDD method which accept an RDD[String] where the string is a JSON String a returns a SchemaRDD, it extends RDD[Row] which the type you want. After words you should be able to do a join to keep your tuple. Best, Ayoub. 2015-01-29 10:12 GMT+01:00

Re: SQL query over (Long, JSON string) tuples

2015-01-29 Thread Tobias Pfeiffer
Hi Ayoub, thanks for your mail! On Thu, Jan 29, 2015 at 6:23 PM, Ayoub benali.ayoub.i...@gmail.com wrote: SQLContext and hiveContext have a jsonRDD method which accept an RDD[String] where the string is a JSON String a returns a SchemaRDD, it extends RDD[Row] which the type you want. After

Re: Dependency unresolved hadoop-yarn-common 1.0.4 when running quickstart example

2015-01-29 Thread Arush Kharbanda
Hi Sarwar, For a quick fix you can exclude dependencies for yarn(you wont be needing them if you are running locally). libraryDependencies += log4j % log4j % 1.2.15 exclude(javax.jms, jms) You can also analyze your dependencies using this plugin

SQL query over (Long, JSON string) tuples

2015-01-29 Thread Tobias Pfeiffer
Hi, I have data as RDD[(Long, String)], where the Long is a timestamp and the String is a JSON-encoded string. I want to infer the schema of the JSON and then do a SQL statement on the data (no aggregates, just column selection and UDF application), but still have the timestamp associated with

Re: Data are partial to a specific partition after sort

2015-01-29 Thread Sean Owen
(By the way, you can use wordRDD.countByValue instead of the map and reduceByKey. It won't make a difference to your issue but is more compact.) As you say, the problem is the very limited range of keys (word lengths). I wonder if you can use sortBy instead of map and sortByKey, and instead

unknown issue in submitting a spark job

2015-01-29 Thread ey-chih chow
Hi, I submitted a job using spark-submit and got the following exception. Anybody knows how to fix this? Thanks. Ey-Chih Chow 15/01/29 08:53:10 INFO storage.BlockManagerMasterActor: Registering block manager

Re: unknown issue in submitting a spark job

2015-01-29 Thread Arush Kharbanda
Hi There are 2 ways to resolve the issue. 1.Increasing the heap size, via -Xmx1024m (or more), or 2.Disabling the error check altogether, via -XX:-UseGCOverheadLimit. as per http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded you can pass the java

Re: Dependency unresolved hadoop-yarn-common 1.0.4 when running quickstart example

2015-01-29 Thread Sarwar Bhuiyan
Thanks Arush. I did look into the dependency tree but couldn't figure which dependency was bringing the wrong Hadoop-yarn-common in. I'll try he quick fix first. Sarwar On Thu, 29 Jan 2015 at 09:33 Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi Sarwar, For a quick fix you can exclude

Re: is there a master for spark cluster in ec2

2015-01-29 Thread Arush Kharbanda
Hi Mohit, You can set the master instance type with -m. To setup a cluster you need to use the ec2/spark-ec2 script. You need to create a AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your aws web console under Security Credentials. And pass it on to script above. Once you do that you should

Re: Spark (Streaming?) holding on to Mesos resources

2015-01-29 Thread Gerard Maas
Thanks a lot. After reading Mesos-1688, I still don't understand how/why a job will hoard and hold on to so many resources even in the presence of that bug. Looking at the release notes, I think this ticket could be relevant to preventing the behavior we're seeing: [MESOS-186] - Resource offers

Failed to locate the winutils binary in the hadoop binary path

2015-01-29 Thread Naveen Kumar Pokala
Hi, I am facing the following issue when I am connecting from spark-shell. Please tell me how to avoid it. 15/01/29 17:21:27 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop

Streaming: Windowed calculations, Multiple apps

2015-01-29 Thread Jared Maulch
Hello, I would appreciate insights on the following questions: 1) Using Spark Streaming, I would like to keep windowed statistics for the past 30, 60 and 120 minutes. Is there an integrated/better way of doing this than creating three separate windows and pointing them to the same DStream? 2)

Re: Appending to an hdfs file

2015-01-29 Thread Matan Safriel
Thanks. I actually looked up foreachPartition() in this context yesterday, and couldn't land where it's documented in Javadocs or elsewhere.. probably for some silly reason. Can you please point me in the right direction? Many thanks! By the way, I realize the solution should rather be to

Re: Set is not parseable as row field in SparkSql

2015-01-29 Thread Jorge Lopez-Malla
Ok, Cheng. Thank you! Un saludo Jorge López-Malla Matute Big Data Developer Vía de las Dos Castillas, 33. Ática 4. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: 91 828 64 73 // @stratiobd 2015-01-28 19:44 GMT+01:00 Cheng Lian lian.cs@gmail.com: Hey Jorge, This is expected.

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Jay Hutfles
Just curious, is this set to be merged at some point? On Thu Jan 22 2015 at 4:34:46 PM Ankur Dave ankurd...@gmail.com wrote: At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote: I try to execute a simple program that runs the ShortestPaths algorithm

Split RDD along columns

2015-01-29 Thread Schein, Sagi
Hi, I have the following usecase, assuming that I have my data in e.g. hdfs, a single file sequence file containing rows of CSV entries that I can split and build an RDD of arrays of (smaller) strings. What I want to do is to build two RDDs where the first RDD contains a subset of columns and

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-29 Thread Rok Roskar
Thanks for the clarification on the partitioning. I did what you suggested and tried reading in individual part-* files -- some of them are ~1.7Gb in size and that's where it's failing. When I increase the number of partitions before writing to disk, it seems to work. Would be nice if this was

Re: connector for CouchDB

2015-01-29 Thread prateek arora
I am also looking for connector for CouchDB in Spark. did you find anything ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/connector-for-CouchDB-tp18630p21422.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark connector for CouchDB

2015-01-29 Thread prateek arora
i am looking for the spark connector for Couch DB please help me . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-connector-for-CouchDB-tp21421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Hi, Using spark under windows is a really bad idea, because even you solve the problems about hadoop, you probably will meet the problem of java.net.SocketException. connection reset by peer. It is caused by the fact we ask socket port too frequently under windows. In my knowledge, it is really

Re: Failed to locate the winutils binary in the hadoop binary path

2015-01-29 Thread Akhil Das
You need to set your HADOOP_HOME in the environment. Here : Could not locate executable null\bin\winutils.exe in the Hadoop binaries. null is supposed to be your HADOOP_HOME. On 29 Jan 2015 15:57, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, I am facing the following issue when I

RE: Spark on Windows 2008 R2 serv er does not work

2015-01-29 Thread Wang, Ningjun (LNG-NPV)
I solved this problem following this article http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7 1) download compiled winutils.exe from

Re: Spark and S3 server side encryption

2015-01-29 Thread Danny
On Spark 1.2.0 you have the s3a library to work with S3. And there is a config param named fs.s3a.server-side-encryption-algorithm: https://github.com/Aloisius/hadoop-s3a -- View this message in context:

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Hi, I tried to use spark under windows once. However the only solution that I found is to install virtualbox Hope this can help you. Best Gen On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I deployed spark-1.1.0 on Windows 7 and was albe to

Re: SQL query over (Long, JSON string) tuples

2015-01-29 Thread Michael Armbrust
Eventually it would be nice for us to have some sort of function to do the conversion you are talking about on a single column, but for now I usually hack it as you suggested: val withId = origRDD.map { case (id, str) = s{id:$id, ${str.trim.drop(1)} } val table = sqlContext.jsonRDD(withId) On

Re: Building Spark behind a proxy

2015-01-29 Thread Arush Kharbanda
Does the error change on build with and without the built options? Did you try using maven? and doing the proxy settings there. On Thu, Jan 29, 2015 at 9:17 PM, Soumya Simanta soumya.sima...@gmail.com wrote: I'm trying to build Spark (v1.1.1 and v1.2.0) behind a proxy using ./sbt/sbt assembly

Re: Hive on Spark vs. SparkSQL using Hive ?

2015-01-29 Thread Michael Armbrust
I would characterize the difference as follows: Spark SQL http://spark.apache.org/docs/latest/sql-programming-guide.html is the native engine for processing structured data using Spark. In contrast to Shark or Hive on Spark is has its own optimizer that was designed for the RDD model. It is

Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
I'm trying to build Spark (v1.1.1 and v1.2.0) behind a proxy using ./sbt/sbt assembly and I get the following error. I've set the http and https proxy as well as the JAVA_OPTS. Any idea what am I missing ? [warn] one warning found org.apache.maven.model.building.ModelBuildingException: 1 problem

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread Mohit Jaggi
Francois, RDD.aggregate() does not support aggregation by key. But, indeed, that is the kind of implementation I am looking for, one that does not allocate intermediate space for storing (K,V) pairs. When working with large datasets this type of intermediate memory allocation wrecks havoc with

Re: schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Michael Armbrust
You can use coalesce or repartition to control the number of file output by any Spark operation. On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 on Hadoop 2.3 Read one big csv file, create a schemaRDD on it and saveAsParquetFile. It creates a large

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread francois . garillot
Oh, I’m sorry, I meant `aggregateByKey`. https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions — FG On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi mohitja...@gmail.com wrote: Francois, RDD.aggregate() does not support aggregation by key. But, indeed, that is

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread francois . garillot
Sorry, I answered too fast. Please disregard my last message: I did mean aggregate.  You say: RDD.aggregate() does not support aggregation by key. What would you need aggregation by key for, if you do not, at the beginning, have an RDD of key-value pairs, and do not want to build one ?

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Ankur Dave
Thanks for the reminder. I just created a PR: https://github.com/apache/spark/pull/4273 Ankur On Thu, Jan 29, 2015 at 7:25 AM, Jay Hutfles jayhutf...@gmail.com wrote: Just curious, is this set to be merged at some point? - To

spark challenge: zip with next???

2015-01-29 Thread derrickburns
Here is a spark challenge for you! I have a data set where each entry has a date. I would like to identify gaps in the dates greater larger a given length. For example, if the data were log entries, then the gaps would tell me when I was missing log data for long periods of time. What is the

RE: unknown issue in submitting a spark job

2015-01-29 Thread Mohammed Guller
Looks like the application is using a lot more memory than available. Could be a bug somewhere in the code or just underpowered machine. Hard to say without looking at the code. Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded Mohammed -Original Message- From:

Connecting Cassandra by unknow host

2015-01-29 Thread oxpeople
I have the code set up the Cassandra SparkConf conf = new SparkConf(true); conf.setAppName(Java cassandra RD); conf.set(*spark.cassandra.connection.host, 10.34.224.249*); but I got log try to connect different host. 15/01/29 16:16:42 INFO NettyBlockTransferService: Server created on

RE: spark challenge: zip with next???

2015-01-29 Thread Mohammed Guller
Another solution would be to use the reduce action. Mohammed From: Ganelin, Ilya [mailto:ilya.gane...@capitalone.com] Sent: Thursday, January 29, 2015 1:32 PM To: 'derrickburns'; 'user@spark.apache.org' Subject: RE: spark challenge: zip with next??? Make a copy of your RDD with an extra entry

RE: spark challenge: zip with next???

2015-01-29 Thread Ganelin, Ilya
Make a copy of your RDD with an extra entry in the beginning to offset. The you can zip the two RDDs and run a map to generate an RDD of differences. Sent with Good (www.good.com) -Original Message- From: derrickburns [derrickrbu...@gmail.commailto:derrickrbu...@gmail.com] Sent:

Driver startup error when submitting in cluster mode

2015-01-29 Thread nate.y
Hello everyone. I am having what I am sure is a configuration error. I am trying to use my spark cluster in cluster mode with out success. So far search results have not yielded any clues. If I use my the same submit command but with client mode specified everything works fine. I have tried

Re: connector for CouchDB

2015-01-29 Thread prateek arora
yes please but i am new for spark and couchdb . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/connector-for-CouchDB-tp18630p21428.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: unknown issue in submitting a spark job

2015-01-29 Thread Mohammed Guller
How much memory are you assigning to the Spark executor on the worker node? Mohammed From: ey-chih chow [mailto:eyc...@hotmail.com] Sent: Thursday, January 29, 2015 3:35 PM To: Mohammed Guller; user@spark.apache.org Subject: RE: unknown issue in submitting a spark job The worker node has 15G

Error when running spark in debug mode

2015-01-29 Thread Ankur Srivastava
Hi, When ever I enable DEBUG level logs for my spark cluster, on running a job all the executors die with the below exception. On disabling the DEBUG logs my jobs move to the next step. I am on spark-1.1.0 Is this a known issue with spark? Thanks Ankur 2015-01-29 22:27:42,467 [main] INFO

Re: connector for CouchDB

2015-01-29 Thread prateek arora
I can also switch to the mongodb if spark have a support for the. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/connector-for-CouchDB-tp18630p21429.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Connecting Cassandra by unknow host

2015-01-29 Thread Ankur Srivastava
Hi, I am no expert but have a small application working with Spark and Cassandra. I faced these issues when we were deploying our cluster on EC2 instances with some machines on public network and some on private. This seems to be a similar issue as you are trying to connect to 10.34.224.249

spark with cdh 5.2.1

2015-01-29 Thread Mohit Jaggi
Hi All, I noticed in pom.xml that there is no entry for Hadoop 2.5. Has anyone tried Spark with 2.5.0-cdh5.2.1? Will replicating the 2.4 entry be sufficient to make this work? Mohit. - To unsubscribe, e-mail:

Re: spark with cdh 5.2.1

2015-01-29 Thread Nobuhiro Sue
Mohit, I'm using spark modules provided by Cloudera repos, it works fine. Please add Cloudera maven repo, and specify dependencies with CDH version, like spark-core_2.10-1.1.0-cdh5.2.1. To add Cloudera maven repo, see:

RE: unknown issue in submitting a spark job

2015-01-29 Thread ey-chih chow
The worker node has 15G memory, 1x32 GB SSD, and 2 core. The data file is from S3. If I don't set mapred.max.split.size, it is fine with only one partition. Otherwise, it will generate OOME. Ey-Chih Chow From: moham...@glassbeam.com To: eyc...@hotmail.com; user@spark.apache.org Subject: RE:

Re: We are migrating Tera Data SQL to Spark SQL. Query is taking long time. Please have a look on this issue

2015-01-29 Thread hnahak
do set executor memory as well. You have RAM in each node and storage. set it o 6 GB or more , if require change driver memory from 10 gb to more. --Harihar -- View this message in context:

What could cause number of tasks to go down from 2k to 1?

2015-01-29 Thread freedafeng
Hi, The input data has 2048 partitions. The final step is to load the processed data into hbase through saveAsNewAPIHadoopDataset(). Every step except the last one ran in parallel in the cluster. But the last step only has 1 task which runs on only 1 node using one core. Spark 1.1.1. +

Re: spark challenge: zip with next???

2015-01-29 Thread Tobias Pfeiffer
Hi, On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Make a copy of your RDD with an extra entry in the beginning to offset. The you can zip the two RDDs and run a map to generate an RDD of differences. Does that work? I recently tried something to compute

KMeans with large clusters Java Heap Space

2015-01-29 Thread mvsundaresan
Trying to cluster small text msgs, using HashingTF and IDF with L2 Normalization. Data looks like this id, msg 1, some text1 2, some more text2 3, sample text 3 Input data file size is 1.7 MB with 10 K rows. It runs (very slow took 3 hrs) for upto 20 clusters, but when I ask for 200 clusters

Re: spark challenge: zip with next???

2015-01-29 Thread Mohit Jaggi
http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E you can use the MLLib

Error when get data from hive table. Use python code.

2015-01-29 Thread QiuxuanZhu
*Dear all,* *I have no idea when it raises an error when I run the following code.* def getRow(data): return data.msg first_sql = select * from logs.event where dt = '20150120' and et = 'ppc' LIMIT 10#error #first_sql = select * from hivecrawler.vip_crawler where src='xx' and dt=' +

Re: Error when get data from hive table. Use python code.

2015-01-29 Thread Cheng Lian
What version of Spark and Hive are you using? Spark 1.1.0 and prior version /only/ support Hive 0.12.0. Spark 1.2.0 supports Hive 0.12.0 /or/ 0.13.1. Cheng On 1/29/15 6:36 PM, QiuxuanZhu wrote: *Dear all, * *I have no idea when it raises an error when I run the following code.* * * def

HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-29 Thread matroyd
Hi,I am trying saveAsTable on SchemaRDD created from HiveContext and it fails. This is on Spark 1.2.0.Following are details of the code, command and exceptions: http://stackoverflow.com/questions/28222496/how-to-enable-sql-on-schemardd-via-the-jdbc-interface-is-it-even-possible

Re: Error when get data from hive table. Use python code.

2015-01-29 Thread Zhan Zhang
You are running yarn-client mode. How about increase the --driver-memory and give it a try? Thanks. Zhan Zhang On Jan 29, 2015, at 6:36 PM, QiuxuanZhu ilsh1...@gmail.commailto:ilsh1...@gmail.com wrote: Dear all, I have no idea when it raises an error when I run the following code. def

Re: Error when get data from hive table. Use python code.

2015-01-29 Thread Davies Liu
On Thu, Jan 29, 2015 at 6:36 PM, QiuxuanZhu ilsh1...@gmail.com wrote: Dear all, I have no idea when it raises an error when I run the following code. def getRow(data): return data.msg first_sql = select * from logs.event where dt = '20150120' and et = 'ppc' LIMIT 10#error

Read from file and broadcast before every Spark Streaming bucket?

2015-01-29 Thread YaoPau
I'm creating a real-time visualization of counts of ads shown on my website, using that data pushed through by Spark Streaming. To avoid clutter, it only looks good to show 4 or 5 lines on my visualization at once (corresponding to 4 or 5 different ads), but there are 50+ different ads that show

RE: schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Felix C
Try rdd.coalesce(1).saveAsParquetFile(...) http://spark.apache.org/docs/1.2.0/programming-guide.html#transformations --- Original Message --- From: Manoj Samel manojsamelt...@gmail.com Sent: January 29, 2015 9:28 AM To: user@spark.apache.org Subject: schemaRDD.saveAsParquetFile creates large

Re: HW imbalance

2015-01-29 Thread Sandy Ryza
My answer was based off the specs that Antony mentioned: different amounts of memory, but 10 cores on all the boxes. In that case, a single Spark application's homogeneously sized executors won't be able to take advantage of the extra memory on the bigger boxes. Cloudera Manager can certainly

Re: connector for CouchDB

2015-01-29 Thread Harihar Nahak
No, I changed it to MongoDB. but you can write you custom code to connect couchDB directly but in market there is no such connector available. with few classes extends you can achieve to read couch DB. I can help you in that let me know if you really interested. On 30 January 2015 at 06:46,

RE: unknown issue in submitting a spark job

2015-01-29 Thread ey-chih chow
I use the default value, which I think is 512MB. If I change to 1024MB, Spark submit will fail due to not enough memory for rdd. Ey-Chih Chow From: moham...@glassbeam.com To: eyc...@hotmail.com; user@spark.apache.org Subject: RE: unknown issue in submitting a spark job Date: Fri, 30 Jan 2015

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-29 Thread Zhan Zhang
I think it is expected. Refer to the comments in saveAsTable Note that this currently only works with SchemaRDDs that are created from a HiveContext”. If I understand correctly, here the SchemaRDD means those generated by HiveContext.sql, instead of applySchema. Thanks. Zhan Zhang On Jan

Fwd: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-29 Thread Ayoub
Hello, I had the same issue then I found this JIRA ticket https://issues.apache.org/jira/browse/SPARK-4825 So I switched to Spark 1.2.1-snapshot witch solved the problem. 2015-01-30 8:40 GMT+01:00 Zhan Zhang zzh...@hortonworks.com: I think it is expected. Refer to the comments in

Re: HW imbalance

2015-01-29 Thread Michael Segel
@Sandy, There are two issues. The spark context (executor) and then the cluster under YARN. If you have a box where each yarn job needs 3GB, and your machine has 36GB dedicated as a YARN resource, you can run 12 executors on the single node. If you have a box that has 72GB dedicated to