Re: adding rows to a DataFrame

2016-03-11 Thread Bijay Pathak
Here is another way you can achieve that(in Python): base_df.withColumn("column_name","column_expression_for_new_column") # To add new row create the data frame containing the new row and do the unionAll() base_df.unionAll(new_df) # Another approach convert to rdd add required fields and convert

Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
It’s pretty simple, really: import com.fasterxml.jackson.databind.ObjectMapper import org.apache.spark.ml.UnaryTransformer import org.apache.spark.ml.util.Identifiable import org.apache.spark.sql.types.{DataType, StringType} /** * A SparkML Transformer that will transform an * entity of type T

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Tristan Nixon
You must be relying on IntelliJ to compile your scala, because you haven’t set up any scala plugin to compile it from maven. You should have something like this in your plugins: net.alchim31.maven scala-maven-plugin scala-compile-first process-resources compile

Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
So I think in your case you’d do something more like: val jsontrans = new JsonSerializationTransformer[StructType].setInputCol(“event").setOutputCol(“eventJSON") > On Mar 11, 2016, at 3:51 PM, Tristan Nixon wrote: > > val jsontrans = new >

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-11 Thread Tristan Nixon
I recommend you package all your dependencies (jars, .so’s, etc.) into a single uber-jar and then submit that. It’s much more convenient than trying to manage including everything in the --jars arg of spark-submit. If you build with maven than the shade plugin will do this for you nicely:

YARN process with Spark

2016-03-11 Thread Mich Talebzadeh
Hi, Can these be clarified please 1. Can a YARN container use more than one core and if this is configurable? 2. A YARN container is constraint to 8MB by " yarn.scheduler.maximum-allocation-mb". If a YARN container is a Spark process will that limit also include the memory Spark

Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
Forgot to mention. To avoid unnecessary container termination add the following setting to yarn yarn.nodemanager.vmem-check-enabled = false

Re: adding rows to a DataFrame

2016-03-11 Thread Jan Štěrba
It very much depends on the logic that generates the new rows. Is it per row (i.e. without context?) then you can just convert to RDD and perform a map operation on each row. JavaPairRDD grouped = dataFrame.javaRDD().groupBy( group by what you need, probably ID ); return

Re: YARN process with Spark

2016-03-11 Thread Mich Talebzadeh
Thanks Koert and Alexander I think the yarn configuration parameters in yarn-site,xml are important. For those I have yarn.nodemanager.resource.memory-mb Amount of max physical memory, in MB, that can be allocated for YARN containers. 8192 yarn.nodemanager.vmem-pmem-ratio Ratio

Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
YARN cores are virtual cores which are used just to calculate available resources. But usually memory is used to manage yarn resources (not cores) spark executor memory should be ~90% of yarn.scheduler.maximum-allocation-mb (which should be the same as yarn.nodemanager.resource.memory-mb) ~10%

Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
you need to set yarn.scheduler.minimum-allocation-mb=32 otherwise Spark AM container will be running on dedicated box instead of running together with the executor container on one of the boxes for slaves I use Amazon EC2 r3.2xlarge box (61GB / 8 cores) - cost ~$0.10 / hour (spot instance)

Re: YARN process with Spark

2016-03-11 Thread Koert Kuipers
you get a spark executor per yarn container. the spark executor can have multiple cores, yes. this is configurable. so the number of partitions that can be processed in parallel is num-executors * executor-cores. and for processing a partition the available memory is executor-memory /

Repeating Records w/ Spark + Avro?

2016-03-11 Thread Chris Miller
I have a bit of a strange situation: * import org.apache.avro.generic.{GenericData, GenericRecord} import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper, AvroKey} import org.apache.avro.mapreduce.AvroKeyInputFormat import org.apache.hadoop.io.{NullWritable, WritableUtils}

Strange behavior of collectNeighbors API in GraphX

2016-03-11 Thread Zhaokang Wang
Hi all, These days I have met a problem of GraphX’s strange behavior on |collectNeighbors| API. It seems that this API has side-effects on the Pregel API. It makes Pregel API not work as expected. The following is a small code demo to reproduce this strange behavior. You can get the whole

Re: Can we use spark inside a web service?

2016-03-11 Thread Hemant Bhanawat
Spark-jobserver is an elegant product that builds concurrency on top of Spark. But, the current design of DAGScheduler prevents Spark to become a truly concurrent solution for low latency queries. DagScheduler will turn out to be a bottleneck for low latency queries. Sparrow project was an effort

How to efficiently query a large table with multiple dimensional table?

2016-03-11 Thread ashokkumar rajendran
Hi All, I have a large table with few billions of rows and have a very small table with 4 dimensional values. I would like to get rows that match any of these dimensions. For example, Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR A.dimension2 = B.dimension2 OR A.dimension3

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky
Some more diagnostics/suggestions: 1) are other services listening to ports in the 4000 range (run "netstat -plunt")? Maybe there is an issue with the error message itself. 2) are you sure the correct java version is used? java -version 3) can you revert all installation attempts you have done

Re: Zeppelin Integration

2016-03-11 Thread Mich Talebzadeh
BTW, when the daemon is stopped on the host, the notebook just hangs if it was running, without any errors. The only way is to tail the last log in $ZEPPELIN_HOME/logs. So I would say a cron type job is required to scan the log for errors. Dr Mich Talebzadeh LinkedIn *

Re: Running ALS on comparitively large RDD

2016-03-11 Thread Deepak Gopalakrishnan
Executor memory : 45g X 4 executors , 1 Driver with 45g memory Data Source is from S3 and I've logs that tells me the Rating objects are loaded fine. On Fri, Mar 11, 2016 at 2:13 PM, Nick Pentreath wrote: > Hmmm, something else is going on there. What data source are

Strange behavior of collectNeighbors API in GraphX

2016-03-11 Thread Zhaokang Wang
Hi all, These days I havemet a problem of GraphX鈥檚 strange behavior on collectNeighborsAPI. It seems that this API has side-effects on the Pregel API.It makes Pregel API not work as expected. The following is asmall code demo to reproduce this

kill Spark Streaming job gracefully

2016-03-11 Thread Shams ul Haque
Hi, I want to kill a Spark Streaming job gracefully, so that whatever Spark has picked from Kafka have processed. My Spark version is: 1.6.0 When i tried killing a Spark Streaming Job from Spark UI dosen't stop app completely. In Spark-UI job is moved to COMPLETED section, but in log it

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky
regarding my previous message, I forgot to mention to run netstat as root (sudo netstat -plunt) sorry for the noise On Fri, Mar 11, 2016 at 12:29 AM, Jakob Odersky wrote: > Some more diagnostics/suggestions: > > 1) are other services listening to ports in the 4000 range (run >

can checkpoint and write ahead log save the data in queued batch?

2016-03-11 Thread Yu Xie
Hi spark user I am running an spark streaming app that use receiver from a pubsub system, and the pubsub system does NOT support ack. And I don't want the data to be lost if there is a driver failure, and by accident, the batches queue up at that time. I tested by generating some queued

Re: Running ALS on comparitively large RDD

2016-03-11 Thread Nick Pentreath
Hmmm, something else is going on there. What data source are you reading from? How much driver and executor memory have you provided to Spark? On Fri, 11 Mar 2016 at 09:21 Deepak Gopalakrishnan wrote: > 1. I'm using about 1 million users against few thousand products. I >

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Nick Pentreath
Would you mind letting us know the # training examples in the datasets? Also, what do your features look like? Are they text, categorical etc? You mention that most rows only have a few features, and all rows together have a few 10,000s features, yet your max feature value is 20 million. How are

Re: Spark configuration with 5 nodes

2016-03-11 Thread Steve Loughran
On 10 Mar 2016, at 22:15, Ashok Kumar > wrote: Hi, We intend to use 5 servers which will be utilized for building Bigdata Hadoop data warehouse system (not using any propriety distribution like Hortonworks or Cloudera or

Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Siva
Hi Everyone, All of sudden we are encountering the below error from one of the spark consumer. It used to work before without any issues. When I restart the consumer with latest offsets, it is working fine for sometime (it executed few batches) and it fails again, this issue is intermittent.

Spark with Yarn Client

2016-03-11 Thread Divya Gehlot
Hi, I am trying to understand behaviour /configuration of spark with yarn client on hadoop cluster . Can somebody help me or point me document /blog/books which has deeper understanding of above two. Thanks, Divya

Spark session dies in about 2 days: HDFS_DELEGATION_TOKEN token can't be found

2016-03-11 Thread Ruslan Dautkhanov
Spark session dies out after ~40 hours when running against Hadoop Secure cluster. spark-submit has --principal and --keytab so kerberos ticket renewal works fine according to logs. Some happens with HDFS dfs connection? These messages come up every 1 second: See complete stack:

Spark Serializer VS Hadoop Serializer

2016-03-11 Thread Fei Hu
Hi, I am trying to migrate the program from Hadoop to Spark, but I met a problem about the serialization. In the Hadoop program, the key and value classes implement org.apache.hadoop.io.WritableComparable, which are for the serialization. Now in the spark program, I used newAPIHadoopRDD to

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Vasu Parameswaran
Added these to the pom and still the same error :-(. I will look into sbt as well. On Fri, Mar 11, 2016 at 2:31 PM, Tristan Nixon wrote: > You must be relying on IntelliJ to compile your scala, because you haven’t > set up any scala plugin to compile it from maven. >

Re: Repeating Records w/ Spark + Avro?

2016-03-11 Thread Peyman Mohajerian
Here is the reason for the behavior: '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an aggregation or shuffle operation will create many references to the same object. If you plan to

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Tristan Nixon
Right, well I don’t think the issue is with how you’re compiling the scala. I think it’s a conflict between different versions of several libs. I had similar issues with my spark modules. You need to make sure you’re not loading a different version of the same lib that is clobbering another

Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Ted Yu
KafkaLZ4BlockOutputStream is in kafka-clients jar : $ jar tvf kafka-clients-0.8.2.0.jar | grep KafkaLZ4BlockOutputStream 1609 Wed Jan 28 22:30:36 PST 2015 org/apache/kafka/common/message/KafkaLZ4BlockOutputStream$BD.class 2918 Wed Jan 28 22:30:36 PST 2015

Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Pankaj Wahane
Next thing you may want to check is if the jar has been provided to all the executors in your cluster. Most of the class not found errors got resolved for me after making required jars available in the SparkContext. Thanks. From: Ted Yu > Date:

Re: Spark with Yarn Client

2016-03-11 Thread Alexander Pivovarov
Check doc - http://spark.apache.org/docs/latest/running-on-yarn.html also you can start EMR-4.2.0 or 4.3.0 cluster with Spark app and see how it's configured On Fri, Mar 11, 2016 at 7:50 PM, Divya Gehlot wrote: > Hi, > I am trying to understand behaviour /configuration

Re: ALS update without re-computing everything

2016-03-11 Thread Nick Pentreath
Currently this is not supported. If you want to do incremental fold-in of new data you would need to do it outside of Spark (e.g. see this discussion: https://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/browser, which also mentions a streaming on-line MF implementation with SGD). In

RE: Spark on YARN memory consumption

2016-03-11 Thread Silvio Fiorito
Hi Jan, Yes what you’re seeing is due to YARN container memory overhead. Also, typically the memory increments for YARN containers is 1GB. This gives a good overview: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Thanks, Silvio From: Jan

Re: Spark on YARN memory consumption

2016-03-11 Thread Jan Štěrba
Thanks that explains a lot. -- Jan Sterba https://twitter.com/honzasterba | http://flickr.com/honzasterba | http://500px.com/honzasterba On Fri, Mar 11, 2016 at 2:36 PM, Silvio Fiorito wrote: > Hi Jan, > > > > Yes what you’re seeing is due to YARN container memory

unsubscribe

2016-03-11 Thread ????/??????

RE: Graphx

2016-03-11 Thread John Lilley
Andrew, We conducted some tests for using Graphx to solve the connected-components problem and were disappointed. On 8 nodes of 16GB each, we could not get above 100M edges. On 8 nodes of 60GB each, we could not process 1bn edges. RDD serialization would take excessive time and then we

udf StructField to JSON String

2016-03-11 Thread Caires Vinicius
I have one DataFrame with nested StructField and I want to convert to JSON String. There is anyway to accomplish this?

Spark on YARN memory consumption

2016-03-11 Thread Jan Štěrba
Hello, I am exprimenting with tuning an on demand spark-cluster on top of our cloudera hadoop. I am running Cloudera 5.5.2 with Spark 1.5 right now and I am running spark in yarn-client mode. Right now my main experimentation is about spark.executor.memory property and I have noticed a strange

Re: ALS update without re-computing everything

2016-03-11 Thread Sean Owen
On Fri, Mar 11, 2016 at 12:18 PM, Nick Pentreath wrote: > In general, for serving situations MF models are stored in some other > serving system, so that system may be better suited to do the actual > fold-in. Sean's Oryx project does that, though I'm not sure offhand if

BlockFetchFailed Exception

2016-03-11 Thread Priya Ch
Hi All, I am trying to run spark k-means on a data set which is closely to 1 GB. Most often I seen BlockFetchFailed Exception which I am suspecting because of Out of memory. Here the configuration details- Total cores:12 Total workers:3 Memory per node: 6GB When running the job, I an giving

Re: Can we use spark inside a web service?

2016-03-11 Thread Andrés Ivaldi
nice discussion , I've a question about Web Service with Spark. What Could be the problem using Akka-http as web service (Like play does ) , with one SparkContext created , and the queries over -http akka using only the instance of that SparkContext , Also about Analytics , we are working on

Doubt on data frame

2016-03-11 Thread ram kumar
Hi, I registered a dataframe as a table using registerTempTable and I didn't close the Spark context. Will the table be available for longer time? Thanks

spark-submit with cluster deploy mode fails with ClassNotFoundException (jars are not passed around properley?)

2016-03-11 Thread Hiroyuki Yamada
Hi, I am trying to work with spark-submit with cluster deploy mode in single node, but I keep getting ClassNotFoundException as shown below. (in this case, snakeyaml.jar is not found from the spark cluster) === 16/03/12 14:19:12 INFO Remoting: Starting remoting 16/03/12 14:19:12 INFO Remoting:

Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192, i suspect some input record has Null key. 189 while (records.hasNext) { 190addElementsRead() 191kv = records.next() 192map.changeValue((getPartition(kv._1), kv._1), update) On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph

Re: NullPointerException

2016-03-11 Thread Ted Yu
Which Spark release do you use ? I wonder if the following may have fixed the problem: SPARK-8029 Robust shuffle writer JIRA is down, cannot check now. On Fri, Mar 11, 2016 at 11:01 PM, Saurabh Guru wrote: > I am seeing the following exception in my Spark Cluster every

Re: Correct way to use spark streaming with apache zeppelin

2016-03-11 Thread Mich Talebzadeh
Hi, I use Zeppelin as well and in the notebook mode you can do analytics much like what you do in Spark-shell. You can store your intermediate data in Parquet if you wish and then analyse data the way you like. What is your use case here? Zeppelin as I use it is a web UI to your spark-shell,

Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192 189 while (records.hasNext) { addElementsRead() kv = records.next() map.changeValue((getPartition(kv._1), kv._1), update) maybeSpillCollection(usingMap = true) } On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru wrote: > I am seeing

Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Siva
Thanks a lot Ted and Pankaj for your response. Changing the class path with correct version of kafka jars resolved the issue. Thanks, Sivakumar Bhavanari. On Fri, Mar 11, 2016 at 5:59 PM, Pankaj Wahane wrote: > Next thing you may want to check is if the jar has been

Correct way to use spark streaming with apache zeppelin

2016-03-11 Thread trung kien
Hi all, I've just viewed some Zeppenlin's videos. The intergration between Zeppenlin and Spark is really amazing and i want to use it for my application. In my app, i will have a Spark streaming app to do some basic realtime aggregation ( intermediate data). Then i want to use Zeppenlin to do

NullPointerException

2016-03-11 Thread Saurabh Guru
I am seeing the following exception in my Spark Cluster every few days in production. 2016-03-12 05:30:00,541 - WARN TaskSetManager - Lost task 0.0 in stage 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us -west-1.compute.internal ): java.lang.NullPointerException at

Re: NullPointerException

2016-03-11 Thread Saurabh Guru
I am using the following versions: org.apache.spark spark-streaming_2.10 1.6.0 org.apache.spark spark-streaming-kafka_2.10

Does Spark support in-memory shuffling?

2016-03-11 Thread Xudong Zheng
Hi all, Does Spark support in-memory shuffling now? If not, is there any consideration for it? Thanks! -- Xudong Zheng

Re: Doubt on data frame

2016-03-11 Thread Ted Yu
temporary tables are associated with SessionState which is used by SQLContext. Did you keep the session ? Cheers On Fri, Mar 11, 2016 at 5:02 AM, ram kumar wrote: > Hi, > > I registered a dataframe as a table using registerTempTable > and I didn't close the Spark

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
Can you post your actual code? On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta wrote: > Hi All, I was running the following test: Setup 9 VM runing spark workers > with 1 spark executor each. 1 VM running kafka and spark master. Spark > version is 1.6.0 Kafka version is

RE: Graphx

2016-03-11 Thread John Lilley
A colleague did the experiments and I don’t know exactly how he observed that. I think it was indirect from the Spark diagnostics indicating the amount of I/O he deduced that this was RDD serialization. Also when he added light compression to RDD serialization this improved matters. John

Re: Graphx

2016-03-11 Thread lihu
Hi, John: I am very intersting in your experiment, How can you get that RDD serialization cost lots of time, from the log or some other tools? On Fri, Mar 11, 2016 at 8:46 PM, John Lilley wrote: > Andrew, > > > > We conducted some tests for using Graphx to solve

Re: Graphx

2016-03-11 Thread Ovidiu-Cristian MARCU
Hi, I wonder what version of Spark and different parameter configuration you used. I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters) John: I suppose your C++ app (algorithm) does not scale if you used

Re: udf StructField to JSON String

2016-03-11 Thread Caires Vinicius
Hmm. I think my problem is a little more complex. I'm using https://github.com/databricks/spark-redshift and when I read from JSON file I got this schema. root |-- app: string (nullable = true) |-- ct: long (nullable = true) |-- event: struct (nullable = true) ||-- attributes: struct

spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
I've run into a situation where it would appear that foreachPartition is only running on one of my executors. I have a small cluster (2 executors with 8 cores each). When I run a job with a small file (with 16 partitions) I can see that the 16 partitions are initialized but they all appear to

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Gayathri Murali
I do have 2.7 installed and unittest2 package available. I still see this error : Please install unittest2 to test with Python 2.6 or earlier Had test failures in pyspark.sql.tests with python2.6; see logs. Thanks Gayathri On Fri, Mar 11, 2016 at 10:07 AM, Davies Liu

Re: Is there Graph Partitioning impl for Scala/Spark?

2016-03-11 Thread Alexander Pivovarov
JUNG library has 4 Community Detection (Community Structure) algorithms implemented including Girvan–Newman algorithm (EdgeBetweennessClusterer.java) https://github.com/jrtom/jung/tree/master/jung-algorithms/src/main/java/edu/uci/ics/jung/algorithms/cluster Girvan–Newman algorithm paper

RE: Graphx

2016-03-11 Thread John Lilley
Thanks Alexander, this is really good information. However it reinforces that we cannot use GraphX, because our customers typically have on-prem clusters in the 10-node range. Very few have the kind of horsepower you are talking about. We can’t just tell them to quadruple their cluster size

RE: Graphx

2016-03-11 Thread John Lilley
We have almost zero node info – just an identifying integer. John Lilley From: Alexis Roos [mailto:alexis.r...@gmail.com] Sent: Friday, March 11, 2016 11:24 AM To: Alexander Pivovarov Cc: John Lilley ; Ovidiu-Cristian MARCU

ALS update without re-computing everything

2016-03-11 Thread Roberto Pagliari
In the current implementation of ALS with implicit feedback, when new date come in, it is not possible to update user/product matrices without re-computing everything. Is this feature in planning or any known work around? Thank you,

Re: ALS update without re-computing everything

2016-03-11 Thread Nick Pentreath
There is a general movement to allowing initial models to be specified for Spark ML algorithms, so I'll add a JIRA to that task set. I should be able to work on this as well as other ALS improvements. Oh, another reason fold-in is typically not done in Spark is that for models of any reasonable

Re: How can I join two DataSet of same case class?

2016-03-11 Thread Jacek Laskowski
Hi, Use the names of the datasets not $, i. e. a("edid"). Jacek 11.03.2016 6:09 AM "박주형" napisał(a): > Hi. I want to join two DataSet. but below stderr is shown > > 16/03/11 13:55:51 WARN ColumnName: Constructing trivially true equals > predicate, ''edid = 'edid'. Perhaps

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Nick Pentreath
Ok, I think I understand things better now. For Spark's current implementation, you would need to map those features as you mention. You could also use say StringIndexer -> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with the mapping and training (e.g.

Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
I have a similar situation in an app of mine. I implemented a custom ML Transformer that wraps the Jackson ObjectMapper - this gives you full control over how your custom entities / structs are serialized. > On Mar 11, 2016, at 11:53 AM, Caires Vinicius wrote: > > Hmm. I

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Davies Liu
Spark 2.0 is dropping the support for Python 2.6, it only work with Python 2.7, and 3.4+ On Thu, Mar 10, 2016 at 11:17 PM, Gayathri Murali wrote: > Hi all, > > I am trying to run python unit tests. > > I currently have Python 2.6 and 2.7 installed. I installed

Re: Graphx

2016-03-11 Thread Alexander Pivovarov
we use it in prod 70 boxes, 61GB RAM each GraphX Connected Components works fine on 250M Vertices and 1B Edges (takes about 5-10 min) Spark likes memory, so use r3.2xlarge boxes (61GB) For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge (30.5 GB) (especially if you have

Re: Graphx

2016-03-11 Thread Alexis Roos
Also we keep the Node info minimal as needed for connected components and rejoin later. Alexis On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov wrote: > we use it in prod > > 70 boxes, 61GB RAM each > > GraphX Connected Components works fine on 250M Vertices and 1B

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski
Hi, How do you check which executor is used? Can you include a screenshot of the master's webUI with workers? Jacek 11.03.2016 6:57 PM "Darin McBeath" napisał(a): > I've run into a situation where it would appear that foreachPartition is > only running on one of my

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Holden Karau
So the run tests command allows you to specify the python version to test again - maybe specify python2.7 On Friday, March 11, 2016, Gayathri Murali wrote: > I do have 2.7 installed and unittest2 package available. I still see this > error : > > Please install

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
I can verify this by looking at the log file for the workers. Since I output logging statements in the object called by the foreachPartition, I can see the statements being logged. Oddly, these output statements only occur in one executor (and not the other). It occurs 16 times in this

RE: Graphx

2016-03-11 Thread John Lilley
PS: This is the code I use to generate clustered test dat: public class GenCCInput { public static void main(String[] args) { if (args.length != 2) { System.err.println("Usage: \njava GenCCInput "); System.exit(-1); } int edges = Integer.parseInt(args[0]); int

RE: Graphx

2016-03-11 Thread John Lilley
I suppose for a 2.6bn case we’d need Long: public class GenCCInput { public static void main(String[] args) { if (args.length != 2) { System.err.println("Usage: \njava GenCCInput "); System.exit(-1); } long edges = Long.parseLong(args[0]); long groupSize =

Re: How can I join two DataSet of same case class?

2016-03-11 Thread Xinh Huynh
I think you have to use an alias. To provide an alias to a Dataset: val d1 = a.as("d1") val d2 = b.as("d2") Then join, using the alias in the column names: d1.joinWith(d2, $"d1.edid" === $"d2.edid") Finally, please doublecheck your column names. I did not see "edid" in your case class. Xinh

Re: Spark configuration with 5 nodes

2016-03-11 Thread Mich Talebzadeh
Hi Steve, My argument has always been that if one is going to use Solid State Disks (SSD), it makes sense to have it for NN disks start-up from fsimage etc. Obviously NN lives in memory. Would you also rerommend RAID10 (mirroring & striping) for NN disks? Thanks Dr Mich Talebzadeh

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath wrote: > Would you mind letting us know the # training examples in the datasets? > Also, what do your features look like? Are they text, categorical etc? You > mention that most rows only have a few features, and all rows

Re: Doubt on data frame

2016-03-11 Thread ram kumar
No, I am not aware of it. Can you provide me with the details regarding this. Thanks On Fri, Mar 11, 2016 at 8:25 PM, Ted Yu wrote: > temporary tables are associated with SessionState which is used > by SQLContext. > > Did you keep the session ? > > Cheers > > On Fri, Mar

RE: Graphx

2016-03-11 Thread John Lilley
Ovidiu, IMHO, this is one of the biggest issues facing GraphX and Spark. There are a lot of knobs and levers to pull to affect performance, with very little guidance about which settings work in general. We cannot ship software that requires end-user tuning; it just has to work.

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Jacek Laskowski
Hi, Doh! My eyes are bleeding to go through XMLs...  Where did you specify Scala version? Dunno how it's in maven. p.s. I *strongly* recommend sbt. Jacek 11.03.2016 8:04 PM "Vasu Parameswaran" napisał(a): > Thanks Jacek. Pom is below (Currenlty set to 1.6.1 spark but I

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-11 Thread Jacek Laskowski
Hi, For jars use spark-submit --jars. Dunno about so's. Could that work through jars? Jacek 11.03.2016 8:07 PM "prateek arora" napisał(a): > Hi > > I have multiple node cluster and my spark jobs depend on a native > library (.so files) and some jar files. > > Can

sliding Top N window

2016-03-11 Thread Yakubovich, Alexey
Good day, I have a following task: a stream of “page vies” coming to kafka topic. Each view contains list of product Ids from a visited page. The task: to have in “real time” Top N product. I am interested in some solution that would require minimum intermediate writes … So need to build a

Re: udf StructField to JSON String

2016-03-11 Thread Michael Armbrust
df.select("event").toJSON On Fri, Mar 11, 2016 at 9:53 AM, Caires Vinicius wrote: > Hmm. I think my problem is a little more complex. I'm using > https://github.com/databricks/spark-redshift and when I read from JSON > file I got this schema. > > root > > |-- app: string

Re: adding rows to a DataFrame

2016-03-11 Thread Michael Armbrust
Or look at explode on DataFrame On Fri, Mar 11, 2016 at 10:45 AM, Stefan Panayotov wrote: > Hi, > > I have a problem that requires me to go through the rows in a DataFrame > (or possibly through rows in a JSON file) and conditionally add rows > depending on a value in one of

Re: udf StructField to JSON String

2016-03-11 Thread Caires Vinicius
I would like to see the code as well Tristan! On Fri, Mar 11, 2016 at 1:53 PM Tristan Nixon wrote: > Have you looked at DataFrame.write.json( path )? > > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter > > > On Mar 11, 2016,

Re: Spark property parameters priority

2016-03-11 Thread Jacek Laskowski
Hi It could also be conf/spark-defaults.conf. Jacek 11.03.2016 8:07 PM "Cesar Flores" napisał(a): > > Right now I know of three different things to pass property parameters to > the Spark Context. They are: > >- A) Inside a SparkConf object just before creating the Spark

Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
Have you looked at DataFrame.write.json( path )? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter > On Mar 11, 2016, at 7:15 AM, Caires Vinicius wrote: > > I have one DataFrame with nested StructField and I want to convert to

Re: Doubt on data frame

2016-03-11 Thread Mich Talebzadeh
Temporary tables are created in temp file space within the session. Once the session is closed then the temporary table goes scala> rs.registerTempTable("mytemp") And this is the temporary file created with the above command drwx-- - hdusersupergroup 0 2016-03-11 17:09

Re: Get output of the ALS algorithm.

2016-03-11 Thread Jacek Laskowski
What about write.save(file)? P.s. I'm new to Spark MLlib. 11.03.2016 4:57 AM "Shishir Anshuman" napisał(a): > hello, > > I am new to Apache Spark and would like to get the Recommendation output > of the ALS algorithm in a file. > Please suggest me the solution. > >

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Mukul Gupta
Please note that while building jar of code below, i used spark 1.6.0 + kafka 0.9.0.0 libraries I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the same issue. I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which matches with spark and kafka versions

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
Why are you including a specific dependency on Kafka? Spark's external streaming kafka module already depends on kafka. Can you link to an actual repo with build file etc? On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta wrote: > Please note that while building jar of

Re: Does Spark support in-memory shuffling?

2016-03-11 Thread Ted Yu
Please take a look at SPARK-3376 and discussion on https://github.com/apache/spark/pull/5403 FYI On Fri, Mar 11, 2016 at 6:37 AM, Xudong Zheng wrote: > Hi all, > > Does Spark support in-memory shuffling now? If not, is there any > consideration for it? > > Thanks! > > -- >

Re: Get output of the ALS algorithm.

2016-03-11 Thread Bryan Cutler
Are you trying to save predictions on a dataset to a file, or the model produced after training with ALS? On Thu, Mar 10, 2016 at 7:57 PM, Shishir Anshuman wrote: > hello, > > I am new to Apache Spark and would like to get the Recommendation output > of the ALS

  1   2   >