Re: Serialization issue when using HBase with Spark

2014-12-15 Thread Shixiong Zhu
Just point out a bug in your codes. You should not use `mapPartitions` like that. For details, I recommend Section setup() and cleanup() in Sean Owen's post: http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/ Best Regards, Shixiong Zhu 2014-12-14 16:35

RDD vs Broadcast

2014-12-15 Thread elitejyo
We are developing Spark framework wherein we are moving historical data into RDD sets. Basically, RDD is immutable, read only dataset on which we do operations. Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs. Now there is a

Re: Adding a column to a SchemaRDD

2014-12-15 Thread Yanbo Liang
Hi Nathan, #1 Spark SQL DSL can satisfy your requirement. You can refer the following code snippet: jdata.select(Star(Node), 'seven.getField(mod), 'eleven.getField(mod)) You need to import org.apache.spark.sql.catalyst.analysis.Star in advance. #2 After you make the transform above, you do

Re: Does filter on an RDD scan every data item ?

2014-12-15 Thread nsareen
Thanks! shall try it out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20683.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Tomoya Igarashi
Hi all, I am trying to run Spark job on Playframework + Spark Master/Worker in one Mac. When job ran, I encountered java.lang.ClassNotFoundException. Would you teach me how to solve it? Here is my code in Github. https://github.com/TomoyaIgarashi/spark_cluster_sample * Envrionments: Mac 10.9.5

Why my SQL UDF cannot be registered?

2014-12-15 Thread Xuelin Cao
Hi,      I tried to create a function that to convert an Unix time stamp to the hour number in a day.       It works if the code is like this:sqlContext.registerFunction(toHour, (x:Long)={new java.util.Date(x*1000).getHours})       But, if I do it like this, it doesn't work:       def toHour

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Aniket Bhatnagar
Try the workaround (addClassPathJars(sparkContext, this.getClass.getClassLoader) discussed in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAJOb8buD1B6tUtOfG8_Ok7F95C3=r-zqgffoqsqbjdxd427...@mail.gmail.com%3E Thanks, Aniket On Mon Dec 15 2014 at 07:43:24 Tomoya Igarashi

Re: Serialization issue when using HBase with Spark

2014-12-15 Thread Aniket Bhatnagar
The reason not using sc.newAPIHadoopRDD is it only support one scan each time. I am not sure is that's true. You can use multiple scans as following: val scanStrings = scans.map(scan = convertScanToString(scan)) conf.setStrings(MultiTableInputFormat.SCANS, scanStrings : _*) where

Re: Spark with HBase

2014-12-15 Thread Aniket Bhatnagar
In case you are still looking for help, there has been multiple discussions in this mailing list that you can try searching for. Or you can simply use https://github.com/unicredit/hbase-rdd :-) Thanks, Aniket On Wed Dec 03 2014 at 16:11:47 Ted Yu yuzhih...@gmail.com wrote: Which hbase release

HiveQL support in Cassandra-Spark connector

2014-12-15 Thread shahab
Hi, I just wonder if Cassandra-Spark connector supports executing HiveQL on Cassandra tables? best, /Shahab

Re: JSON Input files

2014-12-15 Thread Madabhattula Rajesh Kumar
Hi Helena and All, I have found one example multi-line json file into an RDD using https://github.com/alexholmes/json-mapreduce;. val data = sc.newAPIHadoopFile( filepath, classOf[MultiLineJsonInputFormat], classOf[LongWritable], classOf[Text], conf ).map(p = (p._1.get,

Re: ...FileNotFoundException: Path is not a file: - error on accessing HDFS with sc.wholeTextFiles

2014-12-15 Thread Karen Murphy
Thanks Akhil, In line with your suggestion I have used the following 2 commands to flatten the directory structure: find . -type f -iname '*' -exec mv '{}' . \; find . -type d -exec rm -rf '{}' \; Kind Regards Karen On 12/12/14 13:25, Akhil Das wrote: I'm not quiet sure whether spark

Re: SchemaRDD partition on specific column values?

2014-12-15 Thread Nitin Goyal
Hi Michael, I have opened following JIRA for the same :- https://issues.apache.org/jira/browse/SPARK-4849 I am having a look at the code to see what can be done and then we can have a discussion over the approach. Let me know if you have any comments/suggestions. Thanks -Nitin On Sun, Dec

Migrating Parquet inputs

2014-12-15 Thread Marius Soutier
Hi, is there an easy way to “migrate” parquet files or indicate optional values in sql statements? I added a couple of new fields that I also use in a schemaRDD.sql() which obviously fails for input files that don’t have the new fields. Thanks - Marius

RE: Why my SQL UDF cannot be registered?

2014-12-15 Thread Cheng, Hao
As the error log shows, you may need to register it as: sqlContext.rgisterFunction(“toHour”, toHour _) The “_” means you are passing the function as parameter, not invoking it. Cheng Hao From: Xuelin Cao [mailto:xuelin...@yahoo.com.INVALID] Sent: Monday, December 15, 2014 5:28 PM To: User

Re: java.lang.IllegalStateException: unread block data

2014-12-15 Thread Akhil
When you say restored, does it mean the internal IP/public IP remain unchanged to you changed them accordingly? (I'm assuming you are using a cloud service like AWS, GCE or Azure). What is the serializer that you are using? Try to set the following before creating the sparkContext, might help

is there a way to interact with Spark clusters remotely?

2014-12-15 Thread Xiaoyong Zhu
Hi experts I am wondering if there is a way to interactive with Spark remotely? i.e. no access to clusters required but submit Python/Scala scripts to cluster and get result based on (REST) APIs. That will facilitate the development process a lot.. Xiaoyong

Re: Pagerank implementation

2014-12-15 Thread kmurph
Hiya, I too am looking for a PageRank solution in GraphX where the probabilities sum to 1. I tried a few modifications, including division by the total number of vertices in the first part of the equation, as well as trying to return full rank instead of delta (though not correctly as evident

integrating long-running Spark jobs with Thriftserver

2014-12-15 Thread Tim Schweichler
Hi everybody, I apologize if the answer to my question is obvious but I haven't been able to find a straightforward solution anywhere on the internet. I have a number of Spark jobs written using the python API that do things like read in data from Amazon S3 to a main table in the Hive

Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Albert Manyà
Hi all. I'm willing to serialize and later load a model trained using mllib's ALS. I've tried usign Java serialization with something like: val model = ALS.trainImplicit(training, rank, numIter, lambda, 1) val fos = new FileOutputStream(model.bin) val oos = new

Re: NullPointerException When Reading Avro Sequence Files

2014-12-15 Thread Simone Franzini
To me this looks like an internal error to the REPL. I am not sure what is causing that. Personally I never use the REPL, can you try typing up your program and running it from an IDE or spark-submit and see if you still get the same error? Simone Franzini, PhD

Re: spark kafka batch integration

2014-12-15 Thread Cody Koeninger
For an alternative take on a similar idea, see https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka An advantage of the approach I'm taking is that the lower and upper offsets of the RDD are known in advance, so it's deterministic. I

Intermittent test failures

2014-12-15 Thread Marius Soutier
Hi, I’m seeing strange, random errors when running unit tests for my Spark jobs. In this particular case I’m using Spark SQL to read and write Parquet files, and one error that I keep running into is this one: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage

Re: spark kafka batch integration

2014-12-15 Thread Koert Kuipers
thanks! i will take a look at your code. didn't realize there was already something out there. good point about upper offsets, i will add that feature to our version as well if you dont mind. i was thinking about making it deterministic for task failure transparently (even if no upper offsets

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Sean Owen
This class is not going to be serializable, as it contains huge RDDs. Even if the right constructor existed the RDDs inside would not serialize. On Mon, Dec 15, 2014 at 4:33 PM, Albert Manyà alber...@eml.cc wrote: Hi all. I'm willing to serialize and later load a model trained using mllib's

Re: is there a way to interact with Spark clusters remotely?

2014-12-15 Thread Akhil Das
Hi Xiaoyong, You could refer this post if you are looking on how to run spark jobs remotely http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html You will of course require network access to the cluster. Thanks Best

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Albert Manyà
In that case, what is the strategy to train a model in some background batch process and make recommendations for some other service in real time? Run both processes in the same spark cluster? Thanks. -- Albert Manyà alber...@eml.cc On Mon, Dec 15, 2014, at 05:58 PM, Sean Owen wrote: This

Re: is there a way to interact with Spark clusters remotely?

2014-12-15 Thread François Le Lay
Have you seen the recent announcement around Spark Kernel using IPython/0MQ protocol ? https://github.com/ibm-et/spark-kernel On Mon, Dec 15, 2014 at 12:06 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Xiaoyong, You could refer this post if you are looking on how to run spark jobs

Re: MLLIB model export: PMML vs MLLIB serialization

2014-12-15 Thread sourabh
Thanks Vincenzo. Are you trying out all the models implemented in mllib? Actually I don't see decision tree there. Sorry if I missed it. When are you planning to merge this to spark branch? Thanks Sourabh On Sun, Dec 14, 2014 at 5:54 PM, selvinsource [via Apache Spark User List]

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread sourabh chaki
Hi Albert, There is some discussion going on here: http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-model-export-PMML-vs-MLLIB-serialization-tc20324.html#a20674 I am also looking for this solution.But looks like until mllib pmml export is ready, there is no full proof solution to export

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Sean Owen
The thing about MatrixFactorizationModel, compared to other models, is that it is huge. It's not just a few coefficients, but whole RDDs of coefficients. I think you could save these RDDs of user/product factors to persistent storage, load them, then recreate the MatrixFactorizationModel that way.

Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2014-12-15 Thread shenghua
Hello, I met a problem when using Spark sql CLI. A custom UDTF with lateral view throws ClassNotFound exception. I did a couple of experiments in same environment (spark version 1.1.1): select + same custom UDTF (Passed) select + lateral view + custom UDTF (ClassNotFoundException) select +

Re: custom spark app name in yarn-cluster mode

2014-12-15 Thread Tomer Benyamini
Thanks Sandy, passing --name works fine :) Tomer On Fri, Dec 12, 2014 at 9:35 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Tomer, In yarn-cluster mode, the application has already been submitted to YARN by the time the SparkContext is created, so it's too late to set the app name

Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
Hi spark users, Do you know how to access rows of row? I have a SchemaRDD called user and register it as a table with the following schema: root |-- user_id: string (nullable = true) |-- item: array (nullable = true) ||-- element: struct (containsNull = false) |||-- item_id:

Re: JSON Input files

2014-12-15 Thread Michael Armbrust
Underneath the covers, jsonFile uses TextInputFormat, which will split files correctly based on new lines. Thus, there is no fixed maximum size for a json object (other than the fact that it must fit into memory on the executors). On Mon, Dec 15, 2014 at 7:22 AM, Madabhattula Rajesh Kumar

Re: Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2014-12-15 Thread Michael Armbrust
Can you add this information to the JIRA? On Mon, Dec 15, 2014 at 10:54 AM, shenghua wansheng...@gmail.com wrote: Hello, I met a problem when using Spark sql CLI. A custom UDTF with lateral view throws ClassNotFound exception. I did a couple of experiments in same environment (spark version

Re: Accessing rows of a row in Spark

2014-12-15 Thread Mark Hamstra
scala val items = Row(1 - orange, 2 - apple) items: org.apache.spark.sql.catalyst.expressions.Row = [(1,orange),(2,apple)] If you literally want an iterator, then this: scala items.toIterator.count { case (user_id, name) = user_id == 1 } res0: Int = 1 ...else: scala items.count { case

Re: Intermittent test failures

2014-12-15 Thread Michael Armbrust
Is it possible that you are starting more than one SparkContext in a single JVM with out stopping previous ones? I'd try testing with Spark 1.2, which will throw an exception in this case. On Mon, Dec 15, 2014 at 8:48 AM, Marius Soutier mps@gmail.com wrote: Hi, I’m seeing strange, random

Re: Spark metrics for ganglia

2014-12-15 Thread danilopds
Thanks tsingfu, I used this configuration based in your post: (with ganglia unicast mode) # Enable GangliaSink for all instances *.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink *.sink.ganglia.host=10.0.0.7 *.sink.ganglia.port=8649 *.sink.ganglia.period=15

Re: printing mllib.linalg.vector

2014-12-15 Thread Xiangrui Meng
you can use the default toString method to get the string representation. if you want to customized, check the indices/values fields. -Xiangrui On Fri, Dec 5, 2014 at 7:32 AM, debbie debbielarso...@hotmail.com wrote: Basic question: What is the best way to loop through one of these and print

Re: Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
Hi Mark, Thank you for helping out. The items I got back from Spark SQL has the type information as follows: scala items res16: org.apache.spark.sql.Row = [WrappedArray([1,orange],[2,apple])] I tried to iterate the items as you suggested but no luck. Best Regards, Jerry On Mon, Dec 15,

Re: MLlib(Logistic Regression) + Spark Streaming.

2014-12-15 Thread Xiangrui Meng
If you want to train offline and predict online, you can use the current LR implementation to train a model and then apply model.predict on the dstream. -Xiangrui On Sun, Dec 7, 2014 at 6:30 PM, Nasir Khan nasirkhan.onl...@gmail.com wrote: I am new to spark. Lets say i want to develop a machine

Re: MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-15 Thread Xiangrui Meng
Is it possible that after filtering the feature dimension changed? This may happen if you use LIBSVM format but didn't specify the number of features. -Xiangrui On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I was able to run LinearRegressionwithSGD for a largeer

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Xiangrui Meng
Please check the number of partitions after sc.textFile. Use sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai dbt...@dbtsai.com wrote: You just need to use the latest master code without any configuration to get performance improvement from

Re: Stack overflow Error while executing spark SQL

2014-12-15 Thread Xiangrui Meng
Could you post the full stacktrace? It seems to be some recursive call in parsing. -Xiangrui On Tue, Dec 9, 2014 at 7:44 PM, jishnu.prat...@wipro.com wrote: Hi I am getting Stack overflow Error Exception in main java.lang.stackoverflowerror

Re: Accessing rows of a row in Spark

2014-12-15 Thread Mark Hamstra
Looks like you've got one more layer of containment than you intend -- i.e. you've got Row[WrappedArray[Row[(Int, String)]] where you want Row[Row[(Int, String)]]. That's easy to do if somewhere along the line you did something like `val row = Row(collection)` instead of `val row =

Re: what is the best way to implement mini batches?

2014-12-15 Thread Imran Rashid
I'm a little confused by some of the responses. It seems like there are two different issues being discussed here: 1. How to turn a sequential algorithm into something that works on spark. Eg deal with the fact that data is split into partitions which are processed in parallel (though within a

Re: Building Desktop application for ALS-MlLib/ Training ALS

2014-12-15 Thread Xiangrui Meng
On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal saurabh.agra...@markit.com wrote: Hi, I am a new bee in spark and scala world I have been trying to implement Collaborative filtering using MlLib supplied out of the box with Spark and Scala I have 2 problems 1. The best

Re: ALS failure with size Integer.MAX_VALUE

2014-12-15 Thread Xiangrui Meng
Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi Xiangrui, The block size limit was encountered even with reduced number of item blocks as you had expected. I'm wondering if I could try the

Re: Building Desktop application for ALS-MlLib/ Training ALS

2014-12-15 Thread Abhi Basu
In case you must write c# code, you can call python code from c# or use IronPython. :) On Mon, Dec 15, 2014 at 12:04 PM, Xiangrui Meng men...@gmail.com wrote: On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal saurabh.agra...@markit.com wrote: Hi, I am a new bee in spark and scala

Re: java.lang.IllegalStateException: unread block data

2014-12-15 Thread Morbious
Restored ment reboot slave node with unchanged IP. Funny thing is that for small files spark works fine. I checked hadoop with hdfs also and I'm able to run wordcount on it without any problems (i.e. file about 50GB size). -- View this message in context:

Re: ERROR YarnClientClusterScheduler: Lost executor Akka client disassociated

2014-12-15 Thread DB Tsai
Hi Muhammad, Maybe next time you can use http://pastebin.com/ to format and paste the cleaner scala code snippet so other can help you easier. Also, please only paste the significant portion of stack-trace which causes the issue instead of giant logs. First of all, In your log, it seems that you

Re: Including data nucleus tools

2014-12-15 Thread DB Tsai
Just out of my curiosity. Do you manually apply this patch and see if this can actually resolve the issue? It seems that it was merged at some point, but reverted due to that it causes some stability issue. Sincerely, DB Tsai --- My Blog:

Re: MLLIB model export: PMML vs MLLIB serialization

2014-12-15 Thread selvinsource
I am going to try to export decision tree next, so far I focused on linear models and k-means. Regards, Vincenzo sourabh wrote Thanks Vincenzo. Are you trying out all the models implemented in mllib? Actually I don't see decision tree there. Sorry if I missed it. When are you planning to

Re: pyspark is crashing in this case. why?

2014-12-15 Thread Sameer Farooqui
Adding group back. FYI Geneis - this was on a m3.xlarge with all default settings in Spark. I used Spark version 1.3.0. The 2nd case did work for me: a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(100): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() 14/12/15 16:33:01 WARN

Stop streaming context gracefully when SIGTERM is passed

2014-12-15 Thread Budde, Adam
Hi all, We are using Spark Streaming ETL a large volume of time series datasets. In our current design, each dataset we ETL will have a corresponding Spark Streaming context + process running on our cluster. Each of these processes will be passed configuration options specifying the data

Re: Intermittent test failures

2014-12-15 Thread Marius Soutier
Ok, maybe these test versions will help me then. I’ll check it out. On 15.12.2014, at 22:33, Michael Armbrust mich...@databricks.com wrote: Using a single SparkContext should not cause this problem. In the SQL tests we use TestSQLContext and TestHive which are global singletons for all of

NumberFormatException

2014-12-15 Thread yu
Hello, everyone I know 'NumberFormatException' is due to the reason that String can not be parsed properly, but I really can not find any mistakes for my code. I hope someone may kindly help me. My hdfs file is as follows: 8,22 3,11 40,10 49,47 48,29 24,28 50,30 33,56 4,20 30,38 ... So each line

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Jaonary Rabarisoa
I've tried some additional experiments with kmeans and I finally got it worked as I expected. In fact, the number of partition is critical. I had a data set of 24x784 with 12 partitions. In this case the kmeans algorithm took a very long time (about hours to converge). When I change the

Re: Stop streaming context gracefully when SIGTERM is passed

2014-12-15 Thread Soumitra Kumar
Hi Adam, I have following scala actor based code to do graceful shutdown: class TimerActor (val timeout : Long, val who : Actor) extends Actor { def act { reactWithin (timeout) { case TIMEOUT = who ! SHUTDOWN } } } class SSCReactor (val ssc :

Re: NumberFormatException

2014-12-15 Thread Sean Owen
That certainly looks surprising. Are you sure there are no unprintable characters in the file? On Mon, Dec 15, 2014 at 9:49 PM, yu yuz1...@iastate.edu wrote: The exception info is: 14/12/15 15:35:03 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, h3):

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-15 Thread Paweł Szulc
Yes, this is what I also found in Spark documentation, that foreach can have side effects. Nevertheless I have this weird error, that sometimes files are just empty. using is simply a wrapper that takes our code, makes try-catch-finally and flush close all resources. I honestly have no clue

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-15 Thread Davies Liu
Thinking about that any task could be launched concurrently in different nodes, so in order to make sure the generated files are valid, you need some atomic operation (such as rename) to do it. For example, you could generate a random name for output file, writing the data into it, rename it to

Re: what is the best way to implement mini batches?

2014-12-15 Thread Earthson Lu
Hi Imran, you are right. Sequentially process does not make sense to use spark. I think Sequentially process works if batch for each iteration is large enough(this batch could be processed in parallel). My point is that we shall not run mini-batches in parallel, but it still possible to use

Re: NotSerializableException in Spark Streaming

2014-12-15 Thread Nicholas Chammas
This still seems to be broken. In 1.1.1, it errors immediately on this line (from the above repro script): liveTweets.map(t = noop(t)).print() The stack trace is: org.apache.spark.SparkException: Task not serializable at

Re: NumberFormatException

2014-12-15 Thread Harihar Nahak
Hi Yu, Try this : val data = csv.map( line = line.split(,).map(elem = elem.trim)) //lines in rows data.map( rec = (rec(0).toInt, rec(1).toInt)) to convert into integer. On 16 December 2014 at 10:49, yu [via Apache Spark User List] ml-node+s1001560n20694...@n3.nabble.com wrote: Hello,

Re: JSON Input files

2014-12-15 Thread Madabhattula Rajesh Kumar
Thank you Peter for the clarification. Regards, Rajesh On Tue, Dec 16, 2014 at 12:42 AM, Michael Armbrust mich...@databricks.com wrote: Underneath the covers, jsonFile uses TextInputFormat, which will split files correctly based on new lines. Thus, there is no fixed maximum size for a json

Re: Executor memory

2014-12-15 Thread sandy . ryza
Hi Pala, Spark executors only reserve spark.storage.memoryFraction (default 0.6) of their spark.executor.memory for caching RDDs. The spark UI displays this fraction. spark.executor.memory controls the executor heap size. spark.yarn.executor.memoryOverhead controls the extra that's tacked on

Fetch Failed caused job failed.

2014-12-15 Thread Mars Max
While I was running spark MR job, there was FetchFailed(BlockManagerId(47, xx.com, 40975, 0), shuffleId=2, mapId=5, reduceId=286), then there were many retries, and the job failed finally. And the log showed the following error, does anybody meet this error ? or is it a known issue in

Re: ALS failure with size Integer.MAX_VALUE

2014-12-15 Thread Bharath Ravi Kumar
Ok. We'll try using it in a test cluster running 1.2. On 16-Dec-2014 1:36 am, Xiangrui Meng men...@gmail.com wrote: Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi Xiangrui, The block size

Fwd: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Tomoya Igarashi
Hi Aniket, Thanks for your reply. I followed your advice to modified my code. Here is latest one. https://github.com/TomoyaIgarashi/spark_cluster_sample/commit/ce7613c42d3adbe6ae44e264c11f3829460f3c35 As a result, It works correctly! Thank you very much. But, AssociationError Message appears

Can I set max execution time for any task in a job?

2014-12-15 Thread Mohamed Lrhazi
Is that possible, if not, how would one do it from PySpark ? This probably does not make sense in most cases, but am writing a script where my job involves downloading and pushing data into cassandra.. sometimes a task hangs forever, and I dont really mind killing it.. The job is not actually

RE: is there a way to interact with Spark clusters remotely?

2014-12-15 Thread Xiaoyong Zhu
Thanks all for your information! What Pietro mentioned seems to be the appropriate solution.. I also find a slideshttp://www.slideshare.net/EvanChan2/spark-summit-2014-spark-job-server-talk talking about it. Several quick questions: 1. Is it already available in Spark main branch? (seems

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Aniket Bhatnagar
Seems you are using standalone mode. Can you check spark worker logs or application logs in spark work directory to find any errors? On Tue, Dec 16, 2014, 9:09 AM Tomoya Igarashi tomoya.igarashi.0...@gmail.com wrote: Hi Aniket, Thanks for your reply. I followed your advice to modified my

Re: Can I set max execution time for any task in a job?

2014-12-15 Thread Akhil Das
There is a spark listener interface https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.scheduler.SparkListener which can be used to trigger events like jobStarted, TaskGotResults etc but i don't think you can set execution time anywhere. If a task is hung, its mostly because

Re: Fetch Failed caused job failed.

2014-12-15 Thread Akhil Das
You could try setting the following while creating the sparkContext .set(spark.rdd.compress,true) .set(spark.storage.memoryFraction,1) .set(spark.core.connection.ack.wait.timeout,600) .set(spark.akka.frameSize,50) Thanks Best Regards On Tue, Dec 16, 2014 at 8:30 AM, Mars Max

Re: NumberFormatException

2014-12-15 Thread Akhil Das
There could be some other character like a space or ^M etc. You could try the following and see the actual row. val newstream = datastream.map(row = { try{ val strArray = str.trim().split(,) (strArray(0).toInt, strArray(1).toInt) //Instead try this

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Tomoya Igarashi
Thanks for response. Yes, I am using standalone mode. I couldn't find any errors. But, WARN messages appear in Spark master logs. Here is Spark master logs. https://gist.github.com/TomoyaIgarashi/72145c11d3769c7d1ddb FYI Here is Spark worker logs.

Accessing Apache Spark from Java

2014-12-15 Thread Jai
Hi I have installed a standalone Spark set up in standalone mode in a Linux server and I am trying to access that spark setup from Java in windows. When I try connecting to Spark I see the following exception 14/12/16 12:52:52 WARN TaskSchedulerImpl: Initial job has not accepted any resources;