Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread N B
Hi Jon, In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used interchangeably. If you are trying to collect multiple batches across a DStream into a single RDD, look at the window() operations. Hope this helps Nikunj On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase jon.ch...@gmail.com

Re: HiBench test for hadoop/hive/spark cluster

2015-07-15 Thread Ted Yu
From log file: 15/07/16 11:16:56 INFO mapred.LocalDistributedCacheManager: Creating symlink: /tmp/hadoop-root/mapred/local/1437016615898/user_agents - /opt/HiBench-master/user_agents 15/07/16 11:16:56 INFO mapred.LocalDistributedCacheManager: Localized

Re: Java 8 vs Scala

2015-07-15 Thread spark user
I struggle lots in Scala , almost 10 days n0 improvement , but when i switch to Java 8 , things are so smooth , and I used Data Frame with Redshift and Hive all are looking good .if you are very good In Scala the go with Scala otherwise Java is best fit  . This is just my openion because I am

回复:Re: HiBench test for hadoop/hive/spark cluster

2015-07-15 Thread luohui20001
Hi Ted Thanks for your advice, i found that there is something wrong with hadoop fs -get command, 'cause I believe the localization of hdfs://spark-study:9000/HiBench/Aggregation/temp/user_agents to /tmp/hadoop-root/mapred/local/1437016615898/user_agents is a behaviour like hadoop fs -get

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
I should note that the amount of data in each batch is very small, so I'm not concerned with performance implications of grouping into a single RDD. On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase jon.ch...@gmail.com wrote: I'm currently doing something like this in my Spark Streaming program

RE: HiveThriftServer2.startWithContext error with registerTempTable

2015-07-15 Thread Cheng, Hao
Have you ever try query the “select * from temp_table” from the spark shell? Or can you try the option --jars while starting the spark shell? From: Srikanth [mailto:srikanth...@gmail.com] Sent: Thursday, July 16, 2015 9:36 AM To: user Subject: Re: HiveThriftServer2.startWithContext error with

Re: Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread Lokesh Kumar Padhnavis
Thanks a lot :) On Wed, Jul 15, 2015 at 11:48 PM Marcelo Vanzin van...@cloudera.com wrote: That has never been the correct way to set you app's classpath. Instead, look at http://spark.apache.org/docs/latest/configuration.html and search for extraClassPath. On Wed, Jul 15, 2015 at 9:43 AM,

RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
You can try selectExpr() of DataFrame. for example, y-selectExpr(df, concat(hashtags.text[0],hashtags.text[1])) # [] operator is used to extract an item from an array or sql(hiveContext, select concat(hashtags.text[0],hashtags.text[1]) from table) Yeah, the documentation of SparkR is not

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Ted Yu
Looks like this method should serve Jon's needs: def reduceByWindow( reduceFunc: (T, T) = T, windowDuration: Duration, slideDuration: Duration On Wed, Jul 15, 2015 at 8:23 PM, N B nb.nos...@gmail.com wrote: Hi Jon, In Spark streaming, 1 batch = 1 RDD. Essentially, the

Running foreach on a list of rdds in parallel

2015-07-15 Thread Brandon White
Hello, I have a list of rdds List(rdd1, rdd2, rdd3,rdd4) I would like to save these rdds in parallel. Right now, it is running each operation sequentially. I tried using a rdd of rdd but that does not work. list.foreach { rdd = rdd.saveAsTextFile(/tmp/cache/) } Any ideas?

Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
I'm currently doing something like this in my Spark Streaming program (Java): dStream.foreachRDD((rdd, batchTime) - { log.info(processing RDD from batch {}, batchTime); // my rdd processing code }); Instead of having my

Re: fileStream with old files

2015-07-15 Thread Terry Hole
Hi, Hunter, *What **behavior do you see with the HDFS? The local file system and HDFS should have the same ** behavior.* *Thanks!* *- Terry* Hunter Morgan hunter.mor...@rackspace.com于2015年7月16日周四 上午2:04写道: After moving the setting of the parameter to SparkConf initialization instead of

RE: Python DataFrames: length of ArrayType

2015-07-15 Thread Cheng, Hao
Actually it's supposed to be part of Spark 1.5 release, see https://issues.apache.org/jira/browse/SPARK-8230 You're definitely welcome to contribute to it, let me know if you have any question on implementing it. Cheng Hao -Original Message- From: pedro

Re: HiveThriftServer2.startWithContext error with registerTempTable

2015-07-15 Thread Srikanth
Hello, Re-sending this to see if I'm second time lucky! I've not managed to move past this error. Srikanth On Mon, Jul 13, 2015 at 9:14 PM, Srikanth srikanth...@gmail.com wrote: Hello, I want to expose result of Spark computation to external tools. I plan to do this with Thrift server JDBC

Re: Random Forest Error

2015-07-15 Thread rishikesh
Thanks, that fixed the problem. Cheers Rishi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-Error-tp23847p23850.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: creating a distributed index

2015-07-15 Thread Ankur Dave
The latest version of IndexedRDD supports any key type with a defined serializer https://github.com/amplab/spark-indexedrdd/blob/master/src/main/scala/edu/berkeley/cs/amplab/spark/indexedrdd/KeySerializer.scala, including Strings. It's not released yet, but you can use it from the master branch if

DataFrame InsertIntoJdbc() Runtime Exception on cluster

2015-07-15 Thread Manohar753
Hi All, Am trying to add few new rows for existing table in mysql using DataFrame.But it is adding new rows to the table in local environment but on spark cluster below is the runtime exception. Exception in thread main java.lang.RuntimeException: Table msusers_1 already exists. at

Re: spark sql - group by constant column

2015-07-15 Thread Lior Chaga
I found out the problem. Grouping by a constant column value is indeed impossible. The reason it was working in my project is that I gave the constant column an alias that exists in the schema of the dataframe. The dataframe contained a data_timestamp representing an hour, and I added to the

[SparkR] creating dataframe from json file

2015-07-15 Thread jianshu
hi all, Not sure whether this the right venue to ask. If not, please point me to the right group, if there is any. I'm trying to create a Spark DataFrame from JSON file using jsonFile(). The call was successful, and I can see the DataFrame created. The JSON file I have contains a number of

Re: creating a distributed index

2015-07-15 Thread Jem Tucker
With regards to Indexed structures in Spark are there any alternatives to IndexedRDD for more generic keys including Strings? Thanks Jem On Wed, Jul 15, 2015 at 7:41 AM Burak Yavuz brk...@gmail.com wrote: Hi Swetha, IndexedRDD is available as a package on Spark Packages

Re: Java 8 vs Scala

2015-07-15 Thread Reinis Vicups
We have a complex application that runs productively for couple of months and heavily uses spark in scala. Just to give you some insight on complexity - we do not have such a huge source data (only about 500'000 complex elements), but we have more than a billion transformations and

Re: creating a distributed index

2015-07-15 Thread Jem Tucker
This is very interesting, do you know if this version will be backwards compatible with older versions of Spark (1.2.0)? Thanks, Jem On Wed, Jul 15, 2015 at 10:04 AM Ankur Dave ankurd...@gmail.com wrote: The latest version of IndexedRDD supports any key type with a defined serializer

what is metadata in StructField ?

2015-07-15 Thread matd
I see in StructField that we can provide metadata. What is it meant for ? How is it used by Spark later on ? Are there any rules on what we can/cannot do with it ? I'm building some DataFrame processing, and I need to maintain a set of (meta)data along with the DF. I was wondering if I can use

Re: Java 8 vs Scala

2015-07-15 Thread Gourav Sengupta
Why would you create a class and then instantiate it to store data and change the class every time you have to add a new element? In OOPS terminology a class represents an object, and an object has states - does it not? Purely from a data warehousing perspective - one of the fundamental

Re: Random Forest Error

2015-07-15 Thread Anas Sherwani
For RandomForest classifier, labels should be within the range [0,numClasses-1]. This means, you have to map your labels to 0,1 instead of 1,2. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-Error-tp23847p23848.html Sent from the Apache Spark

Re: Strange behavior of CoalescedRDD

2015-07-15 Thread Konstantin Knizhnik
Looks like the source of the problem is in SqlNewHad\oopRDD.compute method: Created Parquet file reader is intended to be closed at task completion time. This reader contains a lot of references to parquet.bytes.BytesInput object which in turn contains reference sot large byte arrays (some of

Spark Stream suitability

2015-07-15 Thread polariz
Hi, I am am evaluating my options for a project that injects a rich data feed, does some aggregate calculations and allows the user to query on these. The (protobuf) data feed is rich in the sense that it contains several data fields which can be used to calculate several different KPI figures.

Strange behavoir of pyspark with --jars option

2015-07-15 Thread gen tang
Hi, I met some interesting problems with --jars options As I use the third party dependencies: elasticsearch-spark, I pass this jar with the following command: ./bin/spark-submit --jars path-to-dependencies ... It works well. However, if I use HiveContext.sql, spark will lost the dependencies

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Deepak Jain
Leftouterjoin and join apis are super slow in spark. 100x slower than hadoop Sent from my iPhone On 14-Jul-2015, at 10:59 PM, Wush Wu wush...@gmail.com wrote: I don't understand. By the way, the `joinWithCassandraTable` does improve my query time from 40 mins to 3 mins. 2015-07-15

Re: Strange behavoir of pyspark with --jars option

2015-07-15 Thread Burak Yavuz
Hi, I believe the HiveContext uses a different class loader. It then falls back to the system class loader if it can't find the classes in the context class loader. The system class loader contains the classpath passed through --driver-class-path and spark.executor.extraClassPath. The JVM is

Re: Spark on EMR with S3 example (Python)

2015-07-15 Thread Akhil Das
I think any requests going to s3*:// requires the credentials. If they have made it public (via http) then you won't require the keys. Thanks Best Regards On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: Hi Sujit, I just wanted to access public datasets on

spark sql - group by constant column

2015-07-15 Thread Lior Chaga
Hi, Facing a bug with group by in SparkSQL (version 1.4). Registered a JavaRDD with object containing integer fields as a table. Then I'm trying to do a group by, with a constant value in the group by fields: SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures FROM tbl

Re: MLlib LogisticRegressionWithLBFGS error

2015-07-15 Thread Burak Yavuz
Hi, Is this in LibSVM format? If so, the indices should be sorted in increasing order. It seems like they are not sorted. Best, Burak On Tue, Jul 14, 2015 at 7:31 PM, Vi Ngo Van ngovi.se@gmail.com wrote: Hi All, I've met a issue with MLlib when i use LogisticRegressionWithLBFGS my

Re: Research ideas using spark

2015-07-15 Thread Vineel Yalamarthy
Hi Daniel Well said Regards Vineel On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf

Re: Java 8 vs Scala

2015-07-15 Thread 诺铁
I think different team got different answer for this question. my team use scala, and happy with it. On Wed, Jul 15, 2015 at 1:31 PM, Tristan Blakers tris...@blackfrog.org wrote: We have had excellent results operating on RDDs using Java 8 with Lambdas. It’s slightly more verbose than Scala,

Re: MLlib LogisticRegressionWithLBFGS error

2015-07-15 Thread Vi Ngo Van
This is a LibSVM format. I can use this data with libsvm library. In this sample, they are not sorted. I will sort them and try it again. Thanks you, On Wed, Jul 15, 2015 at 1:47 PM, Burak Yavuz brk...@gmail.com wrote: Hi, Is this in LibSVM format? If so, the indices should be sorted in

Re: Research ideas using spark

2015-07-15 Thread Akhil Das
Try to repartition it to a higher number (at least 3-4 times the total # of cpu cores). What operation are you doing? It may happen that if you are doing a join/groupBy sort of operation that task which is taking time is having all the values, in that case you need to use a Partitioner which will

Re: Problem in Understanding concept of Physical Cores

2015-07-15 Thread Aniruddh Sharma
Hi TD, Request your guidance on below 5 queries. Following is the context of them that I would use to evaluate based on your response. a) I need to decide whether to deploy Spark in Standalone mode or in Yarn. But it seems to me that Spark in Yarn is more parallel than Standalone mode (given

Random Forest Error

2015-07-15 Thread rishikesh
Hi I am trying to train a Random Forest over my dataset. I have a binary classification problem. When I call the train method as below model = RandomForest.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={},numTrees=3, featureSubsetStrategy=auto, impurity='gini maxDepth=4,

Re: creating a distributed index

2015-07-15 Thread Burak Yavuz
Hi Swetha, IndexedRDD is available as a package on Spark Packages http://spark-packages.org/package/amplab/spark-indexedrdd. Best, Burak On Tue, Jul 14, 2015 at 5:23 PM, swetha swethakasire...@gmail.com wrote: Hi Ankur, Is IndexedRDD available in Spark 1.4.0? We would like to use this in

Re: Java 8 vs Scala

2015-07-15 Thread Ignacio Blasco
The main advantage of using scala vs java 8 is being able to use a console 2015-07-15 9:27 GMT+02:00 诺铁 noty...@gmail.com: I think different team got different answer for this question. my team use scala, and happy with it. On Wed, Jul 15, 2015 at 1:31 PM, Tristan Blakers

Re: Sessionization using updateStateByKey

2015-07-15 Thread algermissen1971
On 15 Jul 2015, at 17:38, Cody Koeninger c...@koeninger.org wrote: An in-memory hash key data structure of some kind so that you're close to linear on the number of items in a batch, not the number of outstanding keys. That's more complex, because you have to deal with expiration for keys

RE: fileStream with old files

2015-07-15 Thread Hunter Morgan
After moving the setting of the parameter to SparkConf initialization instead of after the context is already initialized, I have it operating reliably on local filesystem, but not on hdfs. Are there any differences in behavior between these two cases I should be aware of? I don’t usually

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Wush Wu
Dear Sujit, Thanks for your suggestion. After testing, the `joinWithCassandraTable` does the trick like what you mentioned. The rdd2 only query those data which have the same key in rdd1. Best, Wush 2015-07-16 0:00 GMT+08:00 Sujit Pal sujitatgt...@gmail.com: Hi Wush, One option may be to

Re: Strange Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-15 Thread Saeed Shahrivari
Yes there is. But the RDD is more than 10 TB and compression does not help. On Wed, Jul 15, 2015 at 8:36 PM, Ted Yu yuzhih...@gmail.com wrote: bq. serializeUncompressed() Is there a method which enables compression ? Just wondering if that would reduce the memory footprint. Cheers On

Re: Sessionization using updateStateByKey

2015-07-15 Thread Silvio Fiorito
Hi Cody, I’ve had success using updateStateByKey for real-time sessionization by aging off timed-out sessions (returning None in the update function). This was on a large commercial website with millions of hits per day. This was over a year ago so I don’t have access to the stats any longer

Re: Research ideas using spark

2015-07-15 Thread shahid ashraf
Sorry Guys! I mistakenly added my question to this thread( Research ideas using spark). Moreover people can ask any question , this spark user group is for that. Cheers!  On Wed, Jul 15, 2015 at 9:43 PM, Robin East robin.e...@xense.co.uk wrote: Well said Will. I would add that you might want

Re: Sessionization using updateStateByKey

2015-07-15 Thread Sean McNamara
I would just like to add that we do the very same/similar thing at Webtrends, updateStateByKey has been a life-saver for our sessionization use-cases. Cheers, Sean On Jul 15, 2015, at 11:20 AM, Silvio Fiorito silvio.fior...@granturing.commailto:silvio.fior...@granturing.com wrote: Hi Cody,

Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
I am trying to analyze my program, in particular to see what the bottleneck is (IO, CPU, network), and started using the event timeline for this. When looking at my Job 0, Stage 0 (the sampler function taking up 5.6 minutes of my 40 minute program), I see in the even timeline that all time is

Re: Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread Marcelo Vanzin
That has never been the correct way to set you app's classpath. Instead, look at http://spark.apache.org/docs/latest/configuration.html and search for extraClassPath. On Wed, Jul 15, 2015 at 9:43 AM, lokeshkumar lok...@dataken.net wrote: Hi forum I have downloaded the latest spark version

Re: Research ideas using spark

2015-07-15 Thread Ravindra
Look at this : http://www.forbes.com/sites/lisabrownlee/2015/07/10/the-11-trillion-internet-of-things-big-data-and-pattern-of-life-pol-analytics/ On Wed, Jul 15, 2015 at 10:19 PM shahid ashraf sha...@trialx.com wrote: Sorry Guys! I mistakenly added my question to this thread( Research ideas

Spark job returns a different result on each run

2015-07-15 Thread sbvarre
I am working on a scala code which performs Linear Regression on certain datasets. Right now I am using 20 cores and 25 executors and everytime I run a Spark job I get a different result. The input size of the files are 2GB and 400 MB.However, when I run the job with 20 cores and 1 executor, I

Re: Sessionization using updateStateByKey

2015-07-15 Thread Cody Koeninger
Don't get me wrong, we've been able to use updateStateByKey for some jobs, and it's certainly convenient. At a certain point though, iterating through every key on every batch is a less viable solution. On Wed, Jul 15, 2015 at 12:38 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: I

spark streaming job to hbase write

2015-07-15 Thread Shushant Arora
Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of

Re: Java 8 vs Scala

2015-07-15 Thread Alan Burlison
On 15/07/2015 08:31, Ignacio Blasco wrote: The main advantage of using scala vs java 8 is being able to use a console https://bugs.openjdk.java.net/browse/JDK-8043364 -- Alan Burlison -- - To unsubscribe, e-mail:

Re: what is metadata in StructField ?

2015-07-15 Thread Peter Rudenko
Hi Mathieu, metadata is very usefull if you need to save some data about a column (e.g. count of null values, cardinality, domain, min/max/std, etc.). It's currently used in ml package in attributes:

updateStateByKey schedule time

2015-07-15 Thread Michel Hubert
Hi, I want to implement a time-out mechanism in de updateStateByKey(...) routine. But is there a way the retrieve the time of the start of the batch corresponding to the call to my updateStateByKey routines? Suppose the streaming has build up some delay then a System.currentTimeMillis() will

DataFrame.write().partitionBy(some_column).parquet(path) produces OutOfMemory with very few items

2015-07-15 Thread Nikos Viorres
Hi, I am trying to test partitioning for DataFrames with parquet usage so i attempted to do df.write().partitionBy(some_column).parquet(path) on a small dataset of 20.000 records which when saved as parquet locally with gzip take 4mb of disk space. However, on my dev machine with

Spark and HDFS

2015-07-15 Thread Jeskanen, Elina
I have Spark 1.4 on my local machine and I would like to connect to our local 4 nodes Cloudera cluster. But how? In the example it says text_file = spark.textFile(hdfs://...), but can you advise me in where to get this hdfs://... -address? Thanks! Elina

Any beginner samples for using ML / MLIB to produce a moving average of a (K, iterable[V])

2015-07-15 Thread Nkechi Achara
Hi all, I am trying to get some summary statistics to retrieve the moving average for several devices that have an array or latency in seconds in this kind of format: deviceLatencyMap = [K:String, Iterable[V: Double]] I understand that there is a MultivariateSummary, but as this is a trait, but

RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
suppose df - jsonFile(sqlContext, json file) You can extract hashtags.text as a Column object using the following command: t - getField(df$hashtags, text) and then you can perform operations on the column. You can extract hashtags.text as a DataFrame using the following command: t -

Re: Research ideas using spark

2015-07-15 Thread Michael Segel
Silly question… When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. Or is this an outdated way of thinking? I am doing my PHD thesis on large scale machine learning e.g Online learning,

Re: Java 8 vs Scala

2015-07-15 Thread vaquar khan
My choice is java 8 On 15 Jul 2015 18:03, Alan Burlison alan.burli...@oracle.com wrote: On 15/07/2015 08:31, Ignacio Blasco wrote: The main advantage of using scala vs java 8 is being able to use a console https://bugs.openjdk.java.net/browse/JDK-8043364 -- Alan Burlison --

Re: Research ideas using spark

2015-07-15 Thread vaquar khan
I would suggest study spark ,flink,strom and based on your understanding and finding prepare your research paper. May be you will invented new spark ☺ Regards, Vaquar khan On 16 Jul 2015 00:47, Michael Segel msegel_had...@hotmail.com wrote: Silly question… When thinking about a PhD thesis…

NotSerializableException in spark 1.4.0

2015-07-15 Thread Chen Song
The streaming job has been running ok in 1.2 and 1.3. After I upgraded to 1.4, I started seeing error as below. It appears that it fails in validate method in StreamingContext. Is there anything changed on 1.4.0 w.r.t DStream checkpointint? Detailed error from driver: 15/07/15 18:00:39 ERROR

Re: Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
I think I found my answer at https://github.com/kayousterhout/trace-analysis: One thing to keep in mind is that Spark does not currently include instrumentation to measure the time spent reading input data from disk or writing job output to disk (the `Output write wait'' shown in the waterfall is

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
Would there be any problem in having spark.executor.instances (or --num-executors) be completely ignored (i.e., even for non-zero values) if spark.dynamicAllocation.enabled is true (i.e., rather than throwing an exception)? I can see how the exception would be helpful if, say, you tried to

Re: Spark Intro

2015-07-15 Thread vaquar khan
Totally agreed with hafasa, you need to identify your requirements and needs before choose spark. If you want to handle data with fast access go to no sql (mongo,aerospike etc) if you need data analytical then spark is best . Regards, Vaquar khan On 14 Jul 2015 20:39, Hafsa Asif

small accumulator gives out of memory error

2015-07-15 Thread AlexG
When I call the following minimal working example, the accumulator matrix is 32-by-100K, and each executor has 64G but I get an out of memory error: Exception in thread main java.lang.OutOfMemoryError: Requested array size exceeds VM limit Here BDM is a Breeze DenseMatrix object

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Ted Yu
Can you show us your function(s) ? Thanks On Wed, Jul 15, 2015 at 12:46 PM, Chen Song chen.song...@gmail.com wrote: The streaming job has been running ok in 1.2 and 1.3. After I upgraded to 1.4, I started seeing error as below. It appears that it fails in validate method in StreamingContext.

get java.io.FileNotFoundException when use addFile Function

2015-07-15 Thread prateek arora
I am trying to write a simple program using addFile Function but getting error in my worker node that file doest not exist tage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, slave2.novalocal): java.io.FileNotFoundException: File

get java.io.FileNotFoundException when use addFile Function

2015-07-15 Thread prateek arora
Hi I am trying to write a simple program using addFile Function but getting error in my worker node that file doest not exist tage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, slave2.novalocal): java.io.FileNotFoundException: File

Python DataFrames, length of array

2015-07-15 Thread pedro
Based on the list of functions here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions there doesn't seem to be a way to get the length of an array in a dataframe without defining a UDF. What I would be looking for is something like this (except

Re: Research ideas using spark

2015-07-15 Thread Jörn Franke
Well one of the strength of spark is standardized general distributed processing allowing many different types of processing, such as graph processing, stream processing etc. The limitation is that it is less performant than one system focusing only on one type of processing (eg graph processing).

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
bump From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Date: Tuesday, July 14, 2015 at 4:23 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Unable to use dynamicAllocation if spark.executor.instances is set in

Re: spark streaming job to hbase write

2015-07-15 Thread Todd Nist
There are there connector packages listed on spark packages web site: http://spark-packages.org/?q=hbase HTH. -Todd On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement of writing in hbase table from Spark streaming app after some

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Tathagata Das
Your streaming job may have been seemingly running ok, but the DStream checkpointing must have been failing in the background. It would have been visible in the log4j logs. In 1.4.0, we enabled fast-failure for that so that checkpointing failures dont get hidden in the background. The fact that

Re: ALS run method versus ALS train versus ALS fit and transform

2015-07-15 Thread Sean Owen
The first two examples are from the .mllib API. Really, the new ALS()...run() form is underneath both of the first two. In the second case, you're calling a convenience method that calls something similar to the first example. The second example is from the new .ml pipelines API. Similar ideas,

Re: Java 8 vs Scala

2015-07-15 Thread Ted Yu
jshell is nice but it is targeting Java 9 Cheers On Wed, Jul 15, 2015 at 5:31 AM, Alan Burlison alan.burli...@oracle.com wrote: On 15/07/2015 08:31, Ignacio Blasco wrote: The main advantage of using scala vs java 8 is being able to use a console

ALS run method versus ALS train versus ALS fit and transform

2015-07-15 Thread Carol McDonald
In the Spark mllib examples MovieLensALS.scala ALS run is used, however in the movie recommendation with mllib tutorial ALS train is used , What is the difference, when should you use one versus the other val model = new ALS() .setRank(params.rank)

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Chen Song
Ah, cool. Thanks. On Wed, Jul 15, 2015 at 5:58 PM, Tathagata Das t...@databricks.com wrote: Spark 1.4.1 just got released! So just download that. Yay for timing. On Wed, Jul 15, 2015 at 2:47 PM, Ted Yu yuzhih...@gmail.com wrote: Should be this one: [SPARK-7180] [SPARK-8090]

Re: Java 8 vs Scala

2015-07-15 Thread Alan Burlison
On 15/07/2015 21:17, Ted Yu wrote: jshell is nice but it is targeting Java 9 Yes I know, just pointing out that eventually Java would have a REPL as well. -- Alan Burlison -- - To unsubscribe, e-mail:

Spark cluster read local files

2015-07-15 Thread Julien Beaudan
Hi all, Is it possible to use Spark to assign each machine in a cluster the same task, but on files in each machine's local file system, and then have the results sent back to the driver program? Thank you in advance! Julien smime.p7s Description: S/MIME Cryptographic Signature

Python DataFrames: length of ArrayType

2015-07-15 Thread pedro
Resubmitting after fixing subscription to mailing list. Based on the list of functions here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions there doesn't seem to be a way to get the length of an array in a dataframe without defining a UDF. What I

Re: Spark and HDFS

2015-07-15 Thread Marcelo Vanzin
On Wed, Jul 15, 2015 at 5:36 AM, Jeskanen, Elina elina.jeska...@cgi.com wrote: I have Spark 1.4 on my local machine and I would like to connect to our local 4 nodes Cloudera cluster. But how? In the example it says text_file = spark.textFile(hdfs://...), but can you advise me in where to

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Chen Song
Thanks Can you point me to the patch to fix the serialization stack? Maybe I can pull it in and rerun my job. Chen On Wed, Jul 15, 2015 at 4:40 PM, Tathagata Das t...@databricks.com wrote: Your streaming job may have been seemingly running ok, but the DStream checkpointing must have been

Announcing Spark 1.4.1!

2015-07-15 Thread Patrick Wendell
Hi All, I'm happy to announce the Spark 1.4.1 maintenance release. We recommend all users on the 1.4 branch upgrade to this release, which contain several important bug fixes. Download Spark 1.4.1 - http://spark.apache.org/downloads.html Release notes -

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Tathagata Das
Spark 1.4.1 just got released! So just download that. Yay for timing. On Wed, Jul 15, 2015 at 2:47 PM, Ted Yu yuzhih...@gmail.com wrote: Should be this one: [SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of SerializationDebugger bugs and limitations ... Closes #6625 from

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Sandy Ryza
Hi Jonathan, This is a problem that has come up for us as well, because we'd like dynamic allocation to be turned on by default in some setups, but not break existing users with these properties. I'm hoping to figure out a way to reconcile these by Spark 1.5. -Sandy On Wed, Jul 15, 2015 at

RE: Any beginner samples for using ML / MLIB to produce a moving average of a (K, iterable[V])

2015-07-15 Thread Mohammed Guller
I could be wrong, but it looks like the only implementation available right now is MultivariateOnlineSummarizer. Mohammed From: Nkechi Achara [mailto:nkach...@googlemail.com] Sent: Wednesday, July 15, 2015 4:31 AM To: user@spark.apache.org Subject: Any beginner samples for using ML / MLIB to

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Andrew Or
Yeah, we could make it a log a warning instead. 2015-07-15 14:29 GMT-07:00 Kelly, Jonathan jonat...@amazon.com: Thanks! Is there an existing JIRA I should watch? ~ Jonathan From: Sandy Ryza sandy.r...@cloudera.com Date: Wednesday, July 15, 2015 at 2:27 PM To: Jonathan Kelly

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Ted Yu
Should be this one: [SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of SerializationDebugger bugs and limitations ... Closes #6625 from tdas/SPARK-7180 and squashes the following commits: On Wed, Jul 15, 2015 at 2:37 PM, Chen Song chen.song...@gmail.com wrote: Thanks Can you point

Re: spark-submit can not resolve spark-hive_2.10

2015-07-15 Thread Hao Ren
Thanks for the reply. Actually, I don't think excluding spark-hive from spark-submit --packages is a good idea. I don't want to recompile spark by assembly for my cluster, every time a new spark release is out. I prefer using binary version of spark and then adding some jars for job execution.

Re: Sessionization using updateStateByKey

2015-07-15 Thread algermissen1971
Hi Cody, oh ... I though that was one of *the* use cases for it. Do you have a suggestion / best practice how to achieve the same thing with better scaling characteristics? Jan On 15 Jul 2015, at 15:33, Cody Koeninger c...@koeninger.org wrote: I personally would try to avoid

java heap error

2015-07-15 Thread AlexG
I'm trying to compute the Frobenius norm error in approximating an IndexedRowMatrix A with the product L*R where L and R are Breeze DenseMatrices. I've written the following function that computes the squared error over each partition of rows then sums up to get the total squared error (ignore

Job aborted due to stage failure: Task not serializable:

2015-07-15 Thread Naveen Dabas
I am using the below code and using kryo serializer .when i run this code i got this error : Task not serializable in commented line2) how broadcast variables are treated in exceotu.are they local variables or can be used in any function defined as global variables. object

Re: Spark on EMR with S3 example (Python)

2015-07-15 Thread Sujit Pal
Hi Roberto, I think you would need to as Akhil said. Just checked from this page: http://aws.amazon.com/public-data-sets/ and clicking through to a few dataset links, all of them are available on s3 (some are available via http and ftp, but I think the point of these datasets are that they are

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Sujit Pal
Hi Wush, One option may be to try a replicated join. Since your rdd1 is small, read it into a collection and broadcast it to the workers, then filter your larger rdd2 against the collection on the workers. -sujit On Tue, Jul 14, 2015 at 11:33 PM, Deepak Jain deepuj...@gmail.com wrote:

out of memory error in treeAggregate

2015-07-15 Thread AlexG
I'm using the following function to compute B*A where B is a 32-by-8mil Breeze DenseMatrix and A is a 8mil-by-100K IndexedRowMatrix. // computes BA where B is a local matrix and A is distributed: let b_i denote the // ith col of B and a_i denote the ith row of A, then BA = sum(b_i a_i) def

Re: Strange Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-15 Thread Ted Yu
bq. serializeUncompressed() Is there a method which enables compression ? Just wondering if that would reduce the memory footprint. Cheers On Wed, Jul 15, 2015 at 8:06 AM, Saeed Shahrivari saeed.shahriv...@gmail.com wrote: I use a simple map/reduce step in a Java/Spark program to remove

  1   2   >