Re: settings from props file seem to be ignored in mesos

2015-06-16 Thread Gary Ogden
There isn't a conf/spark-defaults.conf file in the .tgz. There's a template file, but we didn't think we'd need one. I assumed using the defaults and anything we wanted to override would be in the properties file we load via --properties-file, as well as command line parms (--master etc). On

Re: Spark History Server pointing to S3

2015-06-16 Thread Akhil Das
Not quiet sure, but try pointing the spark.history.fs.logDirectory to your s3 Thanks Best Regards On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: In Spark website it’s stated in the View After the Fact section (

Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Gary Ogden
On the master node, I see this printed over and over in the mesos-master.WARNING log file: W0615 06:06:51.211262 8672 hierarchical_allocator_process.hpp:589] Using the default value of 'refuse_seconds' to create the refused resources filter because the input value is negative Here's what I see

Spark History Server pointing to S3

2015-06-16 Thread Gianluca Privitera
In Spark website it’s stated in the View After the Fact section (https://spark.apache.org/docs/latest/monitoring.html) that you can point the start-history-server.sh script to a directory in order do view the Web UI using the logs as data source. Is it possible to point that script to S3?

Re: Spark standalone mode and kerberized cluster

2015-06-16 Thread Borja Garrido Bear
Thank you for the answer, it doesn't seem to work neither (I've not log into the machine as the spark user, but kinit inside the spark-env script), and also tried inside the job. I've notice when I run pyspark that the kerberos token is used for something, but this same behavior is not presented

RE: Optimizing Streaming from Websphere MQ

2015-06-16 Thread Chaudhary, Umesh
Thanks Akhil for taking this point, I am also talking about the MQ bottleneck. I am currently having 5 receivers for a unreliable Websphere MQ receiver implementations. Is there any proven way to convert this implementation to reliable one ? Regards, Umesh Chaudhary From: Akhil Das

stop streaming context of job failure

2015-06-16 Thread Krot Viacheslav
Hi all, Is there a way to stop streaming context when some batch processing failed? I want to set reasonable reties count, say 10, and if failed - stop context completely. Is that possible?

how to maintain the offset for spark streaming if HDFS is the source

2015-06-16 Thread Manohar753
Hi All, In my usecase HDFS file as source for Spark Stream, the job will process the data line by line but how will make sure to maintain the offset line number(data already processed) while restarting/new code push . Team can you please reply on this is there any configuration in Spark.

RE: stop streaming context of job failure

2015-06-16 Thread Evo Eftimov
https://spark.apache.org/docs/latest/monitoring.html also subscribe to various Listeners for various Metrcis Types e.g. Job Stats/Statuses - this will allow you (in the driver) to decide when to stop the context gracefully (the listening and stopping can be done from a completely

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Koert Kuipers
a skew join (where the dominant key is spread across multiple executors) is pretty standard in other frameworks, see for example in scalding: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala this would be a great addition to

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-16 Thread Sanjay Subramanian
Hi Josh It was great meeting u in person at the spark-summit SFO yesterday. Thanks for discussing potential solutions to the problem. I verified that 2 hive gateway nodes had not been configured correctly. My bad. I added hive-site.xml to the spark Conf directories for these 2 additional hive

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Jon Walton
On Fri, Jun 12, 2015 at 9:43 PM, Michael Armbrust mich...@databricks.com wrote: 2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use 1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is available, would any of the JOIN enhancements help this situation? I

Re: Spark History Server pointing to S3

2015-06-16 Thread Gianluca Privitera
It gives me an exception with org.apache.spark.deploy.history.FsHistoryProvider , a problem with the file system. I can reproduce the exception if you want. It perfectly works if I give a local path, I tested it in 1.3.0 version. Gianluca On 16 Jun 2015, at 15:08, Akhil Das

Re: ALS predictALL not completing

2015-06-16 Thread Ayman Farahat
This is 1.3.1 Ayman Farahat  --  View my research on my SSRN Author page:  http://ssrn.com/author=1594571  From: Nick Pentreath nick.pentre...@gmail.com To: user@spark.apache.org user@spark.apache.org Sent: Tuesday, June 16, 2015 4:23 AM

Re: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread shreesh
I realize that there are a lot of ways to configure my application in spark. The part that is not clear is that how do I decide say for example in how many partitions should I divide my data or how much ram should I have or how many workers should one initialize? -- View this message in

The problem when share data inside Dstream

2015-06-16 Thread Shuai Zhang
Hello guys, I faced one problem that I cannot pass my data inside rdd partition when I was trying to develop spark streaming feature.I'm the newcomer of Spark, could you please give me any suggestion on this problem?  The figure in the attachment is the code I used in my program: After I run my

Unit Testing Spark Transformations/Actions

2015-06-16 Thread Mark Tse
Hi there, I am looking to use Mockito to mock out some functionality while unit testing a Spark application. I currently have code that happily runs on a cluster, but fails when I try to run unit tests against it, throwing a SparkException: org.apache.spark.SparkException: Job aborted due to

RE: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread Evo Eftimov
Best is by measuring and recording how The Performance of your solution scales as The Workload scales - recording As In Data Points recording and then you can do some times series stat analysis and visualizations For example you can start with a single box with e.g. 8 CPU cores Use e.g. 1 or

Re: Creating RDD from Iterable from groupByKey results

2015-06-16 Thread nir
I updated code sample so people can understand better what are my inputs and outputs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-RDD-from-Iterable-from-groupByKey-results-tp23328p23341.html Sent from the Apache Spark User List mailing list

Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread Guru Medasani
Hi Esten, Looks like your sqlContext is connected to a Hadoop/Spark cluster, but the file path you specified is local?. mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json”, Error below shows that the Input path you specified does not exist on the cluster. Pointing to the right

Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread Shivaram Venkataraman
The error you are running into is that the input file does not exist -- You can see it from the following line Input path does not exist: hdfs://smalldata13.hdp:8020/ home/esten/ami/usaf.json Thanks Shivaram On Tue, Jun 16, 2015 at 1:55 AM, esten erik.stens...@dnvgl.com wrote: Hi, In SparkR

HDFS not supported by databricks cloud :-(

2015-06-16 Thread Sanjay Subramanian
hey guys After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is not supported by Databricks cloud.My speed bottleneck is to transfer ~1TB of snapshot HDFS data (250+ external hive tables) to S3 :-(  I want to use databricks cloud but this to me is a starting disabler.The

spark-sql CLI options does not work --master yarn --deploy-mode client

2015-06-16 Thread Sanjay Subramanian
hey guys  I have CDH 5.3.3 with Spark 1.2.0 (on Yarn) This does not work /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql  --deploy-mode client --master yarn --driver-memory 1g -e select j.person_id, p.first_name, p.last_name, count(*) from (select person_id from cdr.cdr_mjp_joborder where

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
+cc user@spark.apache.org Reply inline. On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA) Sauptik.Dhar wrote: Hi DB, Thank you for the reply. That explains a lot. I however had a few points regarding this:- 1. Just to help with the debate of not regularizing the b parameter. A

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
Hi Sujit, That's a good point. But 1-hot encoding will make our data changing from Terabytes to Petabytes, because we have tens of categorical attributes, and some of them contain thousands of categorical values. Is there any way to make a good balance of data size and right representation of

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Sujit Pal
Hi Rexx, In general (ie not Spark specific), its best to convert categorical data to 1-hot encoding rather than integers - that way the algorithm doesn't use the ordering implicit in the integer representation. -sujit On Tue, Jun 16, 2015 at 1:17 PM, Rex X dnsr...@gmail.com wrote: Is it

Re: HDFS not supported by databricks cloud :-(

2015-06-16 Thread Simon Elliston Ball
You could consider using Zeppelin and spark on yarn as an alternative. http://zeppelin.incubator.apache.org/ Simon On 16 Jun 2015, at 17:58, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey guys After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
Is it necessary to convert categorical data into integers? Any tips would be greatly appreciated! -Rex On Sun, Jun 14, 2015 at 10:05 AM, Rex X dnsr...@gmail.com wrote: For clustering analysis, we need a way to measure distances. When the data contains different levels of measurement -

Pyspark Dense Matrix Multiply : One of them can fit in Memory

2015-06-16 Thread afarahat
Hello I would like to Multiply two matrices C = A* B A is a m x k , B is a kxl k,l m so that B can easily fit in memory. Any ideas or suggestions how to do that in Pyspark? Thanks Ayman -- View this message in context:

Re: Spark on EMR

2015-06-16 Thread ayan guha
That's great news. Can I assume spark on EMR supports kinesis to hbase pipeline? On 17 Jun 2015 05:29, kamatsuoka ken...@gmail.com wrote: Spark is now officially supported on Amazon Elastic Map Reduce: http://aws.amazon.com/elasticmapreduce/details/spark/ -- View this message in context:

What happens when a streaming consumer job is killed then restarted?

2015-06-16 Thread dgoldenberg
I'd like to understand better what happens when a streaming consumer job (with direct streaming, but also with receiver-based streaming) is killed/terminated/crashes. Assuming it was processing a batch of RDD data, what happens when the job is restarted? How much state is maintained within

Custom Spark metrics?

2015-06-16 Thread dgoldenberg
I'm looking at the doc here: https://spark.apache.org/docs/latest/monitoring.html. Is there a way to define custom metrics in Spark, via Coda Hale perhaps, and emit those? Can a custom metrics sink be defined? And, can such a sink collect some metrics, execute some metrics handling logic, then

Suggestions for Posting on the User Mailing List

2015-06-16 Thread nsalian
As discussed during the meetup, the following information should help while creating a topic on the User mailing list. 1) Version of Spark and Hadoop should be included to help reproduce the issue or understand if the issue is a version limitation 2) Explanation about the scenario in as much

What is Spark's data retention policy?

2015-06-16 Thread dgoldenberg
What is Spark's data retention policy? As in, the jobs that are sent from the master to the worker nodes, how long do they persist on those nodes? What about the RDD data, how is that cleaned up? Are all RDD's cleaned up at GC time unless they've been .persist()'ed or .cache()'ed? -- View

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Michael Armbrust
this would be a great addition to spark, and ideally it belongs in spark core not sql. I agree with the fact that this would be a great addition, but we would likely want a specialized SQL implementation for performance reasons.

Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread nsalian
Hello, Is the json file in HDFS or local? /home/esten/ami/usaf.json is this an HDFS path? Suggestions: 1) Specify file:/home/esten/ami/usaf.json 2) Or move the usaf.json file into HDFS since the application is looking for the file in HDFS. Please let me know if that helps. Thank you. --

Re: DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-16 Thread Mohammad Tariq
I would really appreciate if someone could help me with this. On Monday, June 15, 2015, Mohammad Tariq donta...@gmail.com wrote: Hello list, The method *insertIntoJDBC(url: String, table: String, overwrite: Boolean)* provided by Spark DataFrame allows us to copy a DataFrame into a JDBC DB

ClassNotFound exception from closure

2015-06-16 Thread Yana Kadiyska
Hi folks, running into a pretty strange issue -- I have a ClassNotFound exception from a closure?! My code looks like this: val jRdd1 = table.map(cassRow={ val lst = List(cassRow.get[Option[Any]](0),cassRow.get[Option[Any]](1)) Row.fromSeq(lst) }) println(sThis one worked

Re: DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-16 Thread Yana Kadiyska
When all else fails look at the source ;) Looks like createJDBCTable is deprecated, but otherwise goes to the same implementation as insertIntoJDBC... https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala You can also look at DataFrameWriter in

Spark or Storm

2015-06-16 Thread asoni . learn
Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail:

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Yanbo Liang
If you run Spark on YARN, the simplest way is replace the $SPARK_HOME/lib/spark-.jar with your own version spark jar file and run your application. The spark-submit script will upload this jar to YARN cluster automatically and then you can run your application as usual. It does not care about

Re: Spark or Storm

2015-06-16 Thread ayan guha
I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and

Incorrect ACL checking for partitioned table in Spark SQL-1.4

2015-06-16 Thread Karthik Subramanian
*Problem Statement:* While doing query on a partitioned table using Spark SQL (Version 1.4.0), access denied exception is observed on the partition the user doesn’t belong to (The user permission is controlled using HDF ACLs). The same works correctly in hive. *Usercase:* /To address

Re: Spark or Storm

2015-06-16 Thread Spark Enthusiast
I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is :                                  send events                                          enrich

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Raghav Shankar
To clarify, I am using the spark standalone cluster. On Tuesday, June 16, 2015, Yanbo Liang yblia...@gmail.com wrote: If you run Spark on YARN, the simplest way is replace the $SPARK_HOME/lib/spark-.jar with your own version spark jar file and run your application. The spark-submit

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
Hi Dhar, For standardization, we can disable it effectively by using different regularization on each component. Thus, we're solving the same problem but having better rate of convergence. This is one of the features I will implement. Sincerely, DB Tsai

Re: Spark or Storm

2015-06-16 Thread Sateesh Kavuri
Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should

Re: number of partitions in join: Spark documentation misleading!

2015-06-16 Thread Davies Liu
Please file a JIRA for it. On Mon, Jun 15, 2015 at 8:00 AM, mrm ma...@skimlinks.com wrote: Hi all, I was looking for an explanation on the number of partitions for a joined rdd. The documentation of Spark 1.3.1. says that: For distributed shuffle operations like reduceByKey and join, the

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Will Briggs
In general, you should avoid making direct changes to the Spark source code. If you are using Scala, you can seamlessly blend your own methods on top of the base RDDs using implicit conversions. Regards, Will On June 16, 2015, at 7:53 PM, raggy raghav0110...@gmail.com wrote: I am trying to

Re: Spark or Storm

2015-06-16 Thread Will Briggs
The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Raghav Shankar
I made the change so that I could implement top() using treeReduce(). A member on here suggested I make the change in RDD.scala to accomplish that. Also, this is for a research project, and not for commercial use. So, any advice on how I can get the spark submit to use my custom built jars

questions on the waiting batches and scheduling delay in Streaming UI

2015-06-16 Thread Fang, Mike
Hi, I have a spark streaming program running for ~ 25hrs. When I check the Streaming UI tab. I found the Waiting batches is 144. But the scheduling delay is 0. I am a bit confused. If the waiting batches is 144, that means many batches are waiting in the queue to be processed? If this is the

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Will Briggs
If this is research-only, and you don't want to have to worry about updating the jars installed by default on the cluster, you can add your custom Spark jar using the spark.driver.extraLibraryPath configuration property when running spark-submit, and then use the experimental

Re: Not getting event logs = spark 1.3.1

2015-06-16 Thread Tsai Li Ming
Forgot to mention this is on standalone mode. Is my configuration wrong? Thanks, Liming On 15 Jun, 2015, at 11:26 pm, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, I have this in my spark-defaults.conf (same for hdfs): spark.eventLog.enabled true spark.eventLog.dir

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Raghav Shankar
The documentation says spark.driver.userClassPathFirst can only be used in cluster mode. Does this mean I have to set the --deploy-mode option for spark-submit to cluster? Or can I still use the default client? My understanding is that even in the default deploy mode, spark still uses the

Submitting Spark Applications using Spark Submit

2015-06-16 Thread raggy
I am trying to submit a spark application using the command line. I used the spark submit command for doing so. I initially setup my Spark application on Eclipse and have been making changes on there. I recently obtained my own version of the Spark source code and added a new method to RDD.scala.

Re: cassandra with jdbcRDD

2015-06-16 Thread Michael Armbrust
I would suggest looking at https://github.com/datastax/spark-cassandra-connector On Tue, Jun 16, 2015 at 4:01 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: hi all! is there a way to connect cassandra with jdbcRDD ? -- View this message in context:

Re: How to use DataFrame with MySQL

2015-06-16 Thread matthewrj
I just ran into this too. Thanks for the tip! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-tp22178p23351.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Unable to use more than 1 executor for spark streaming application with YARN

2015-06-16 Thread Saiph Kappa
Hi, I am running a simple spark streaming application on hadoop 2.7.0/YARN (master: yarn-client) with 2 executors in different machines. However, while the app is running, I can see on the app web UI (tab executors) that only 1 executor keeps completing tasks over time, the other executor only

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-16 Thread Jia Yu
Hi Peng, I got exactly same error! My shuffle data is also very large. Have you figured out a method to solve that? Thanks, Jia On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote: I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster

Spark Configuration of spark.worker.cleanup.appDataTtl

2015-06-16 Thread luohui20001
Hi guys: I added a parameter spark.worker.cleanup.appDataTtl 3 * 24 * 3600 in my conf/spark-default.conf, then I start my spark cluster. However I got an exception: 15/06/16 14:25:14 INFO util.Utils: Successfully started service 'sparkWorker' on port 43344. 15/06/16 14:25:14 ERROR

Re: Spark Configuration of spark.worker.cleanup.appDataTtl

2015-06-16 Thread Saisai Shao
I think you have to using 604800 instead of 7 * 24 * 3600, obviously SparkConf will not do multiplication for you.. The exception is quite obvious: Caused by: java.lang.NumberFormatException: For input string: 3 * 24 * 3600 2015-06-16 14:52 GMT+08:00 luohui20...@sina.com: Hi guys: I

Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-16 Thread Nathan McCarthy
Hi all, Looks like data frame parquet writing is very broken in Spark 1.4.0. We had no problems with Spark 1.3. When trying to save a data frame with 569610608 rows. dfc.write.format(parquet).save(“/data/map_parquet_file) We get random results between runs. Caching the data frame in memory

Re: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread Himanshu Mehra
Hi Shreesh, You can definitely decide the how many partitions your data should break into by passing a, 'minPartition' argument in the method sc.textFile(input/path, minPartition) and 'numSlices' arg in method sc.parallelize(localCollection, numSlices). In fact there is always a option to specify

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-16 Thread Akhil Das
Good question, with fileStream or textFileStream basically it will only takes in the files whose timestamp is the current timestamp

Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Akhil Das
Did you look inside all logs? Mesos logs and executor logs? Thanks Best Regards On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden gog...@gmail.com wrote: My Mesos cluster has 1.5 CPU and 17GB free. If I set: conf.set(spark.mesos.coarse, true); conf.set(spark.cores.max, 1); in the SparkConf

Spark+hive bucketing

2015-06-16 Thread Marcin Szymaniuk
Spark SQL document states: Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn’t support buckets yet What exactly does that mean?: - that writing to bucketed table wont respect this feature and data will be written in not bucketed manner?

Re: Optimizing Streaming from Websphere MQ

2015-06-16 Thread Akhil Das
Each receiver will run on 1 core. So if your network is not the bottleneck then to test the consumption speed of the receivers you can simply do a *dstream.count.print* to see how many records it can receive. (Also it will be available in the Streaming tab of the driver UI). If you spawn 10

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote: Thanks very much, Akhil. That solved my problem. Best, Rex On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das

Re: settings from props file seem to be ignored in mesos

2015-06-16 Thread Akhil Das
Whats in your executor (that .tgz file) conf/spark-default.conf file? Thanks Best Regards On Mon, Jun 15, 2015 at 7:14 PM, Gary Ogden gog...@gmail.com wrote: I'm loading these settings from a properties file: spark.executor.memory=256M spark.cores.max=1 spark.shuffle.consolidateFiles=true

Re: HiveContext saveAsTable create wrong partition

2015-06-16 Thread patcharee
I found if I move the partitioned columns in schemaString and in Row to the end of the sequence, then it works correctly... On 16. juni 2015 11:14, patcharee wrote: Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the

回复:Re: Spark Configuration of spark.worker.cleanup.appDataTtl

2015-06-16 Thread luohui20001
thanks saisai,I should try more times. I thought it will be caculated automatically as the default. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Saisai Shao sai.sai.s...@gmail.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org

SparkR 1.4.0: read.df() function fails

2015-06-16 Thread esten
Hi, In SparkR shell, I invoke: mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, header=false) I have tried various filetypes (csv, txt), all fail. RESPONSE: ERROR RBackendHandler: load on 1 failed BELOW THE WHOLE RESPONSE: 15/06/16 08:09:13 INFO MemoryStore:

HiveContext saveAsTable create wrong partition

2015-06-16 Thread patcharee
Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the table is correct, but the partition(folder) created is totally wrong. Below is my code snippet

cassandra with jdbcRDD

2015-06-16 Thread Hafiz Mujadid
hi all! is there a way to connect cassandra with jdbcRDD ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cassandra-with-jdbcRDD-tp23335.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark standalone mode and kerberized cluster

2015-06-16 Thread Steve Loughran
On 15 Jun 2015, at 15:43, Borja Garrido Bear kazebo...@gmail.commailto:kazebo...@gmail.com wrote: I tried running the job in a standalone cluster and I'm getting this: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException:

Re: Limit Spark Shuffle Disk Usage

2015-06-16 Thread Himanshu Mehra
Hi Al M, You should try proving more main memory to shuffle process and it might reduce spill on disk. The default configuration for shuffle memory fraction is 20% of the safe memory that means 16% of the overall heap memory. so when we set executor memory only a small fraction of it is used in

Re: ALS predictALL not completing

2015-06-16 Thread Nick Pentreath
Which version of Spark are you using? On Tue, Jun 16, 2015 at 6:20 AM, afarahat ayman.fara...@yahoo.com wrote: Hello; I have a data set of about 80 Million users and 12,000 items (very sparse ). I can get the training part working no problem. (model has 20 factors), However, when i try