Re: NullPointerException

2016-03-11 Thread Saurabh Guru
I am using the following versions: org.apache.spark spark-streaming_2.10 1.6.0 org.apache.spark spark-streaming-kafka_2.10

Re: NullPointerException

2016-03-11 Thread Ted Yu
Which Spark release do you use ? I wonder if the following may have fixed the problem: SPARK-8029 Robust shuffle writer JIRA is down, cannot check now. On Fri, Mar 11, 2016 at 11:01 PM, Saurabh Guru wrote: > I am seeing the following exception in my Spark Cluster every

Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192, i suspect some input record has Null key. 189 while (records.hasNext) { 190addElementsRead() 191kv = records.next() 192map.changeValue((getPartition(kv._1), kv._1), update) On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph

Re: Correct way to use spark streaming with apache zeppelin

2016-03-11 Thread Mich Talebzadeh
Hi, I use Zeppelin as well and in the notebook mode you can do analytics much like what you do in Spark-shell. You can store your intermediate data in Parquet if you wish and then analyse data the way you like. What is your use case here? Zeppelin as I use it is a web UI to your spark-shell,

Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192 189 while (records.hasNext) { addElementsRead() kv = records.next() map.changeValue((getPartition(kv._1), kv._1), update) maybeSpillCollection(usingMap = true) } On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru wrote: > I am seeing

Correct way to use spark streaming with apache zeppelin

2016-03-11 Thread trung kien
Hi all, I've just viewed some Zeppenlin's videos. The intergration between Zeppenlin and Spark is really amazing and i want to use it for my application. In my app, i will have a Spark streaming app to do some basic realtime aggregation ( intermediate data). Then i want to use Zeppenlin to do

NullPointerException

2016-03-11 Thread Saurabh Guru
I am seeing the following exception in my Spark Cluster every few days in production. 2016-03-12 05:30:00,541 - WARN TaskSetManager - Lost task 0.0 in stage 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us -west-1.compute.internal ): java.lang.NullPointerException at

Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Siva
Thanks a lot Ted and Pankaj for your response. Changing the class path with correct version of kafka jars resolved the issue. Thanks, Sivakumar Bhavanari. On Fri, Mar 11, 2016 at 5:59 PM, Pankaj Wahane wrote: > Next thing you may want to check is if the jar has been

spark-submit with cluster deploy mode fails with ClassNotFoundException (jars are not passed around properley?)

2016-03-11 Thread Hiroyuki Yamada
Hi, I am trying to work with spark-submit with cluster deploy mode in single node, but I keep getting ClassNotFoundException as shown below. (in this case, snakeyaml.jar is not found from the spark cluster) === 16/03/12 14:19:12 INFO Remoting: Starting remoting 16/03/12 14:19:12 INFO Remoting:

Re: Repeating Records w/ Spark + Avro?

2016-03-11 Thread Peyman Mohajerian
Here is the reason for the behavior: '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an aggregation or shuffle operation will create many references to the same object. If you plan to

Re: Spark with Yarn Client

2016-03-11 Thread Alexander Pivovarov
Check doc - http://spark.apache.org/docs/latest/running-on-yarn.html also you can start EMR-4.2.0 or 4.3.0 cluster with Spark app and see how it's configured On Fri, Mar 11, 2016 at 7:50 PM, Divya Gehlot wrote: > Hi, > I am trying to understand behaviour /configuration

Spark with Yarn Client

2016-03-11 Thread Divya Gehlot
Hi, I am trying to understand behaviour /configuration of spark with yarn client on hadoop cluster . Can somebody help me or point me document /blog/books which has deeper understanding of above two. Thanks, Divya

Spark session dies in about 2 days: HDFS_DELEGATION_TOKEN token can't be found

2016-03-11 Thread Ruslan Dautkhanov
Spark session dies out after ~40 hours when running against Hadoop Secure cluster. spark-submit has --principal and --keytab so kerberos ticket renewal works fine according to logs. Some happens with HDFS dfs connection? These messages come up every 1 second: See complete stack:

Spark Serializer VS Hadoop Serializer

2016-03-11 Thread Fei Hu
Hi, I am trying to migrate the program from Hadoop to Spark, but I met a problem about the serialization. In the Hadoop program, the key and value classes implement org.apache.hadoop.io.WritableComparable, which are for the serialization. Now in the spark program, I used newAPIHadoopRDD to

Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Pankaj Wahane
Next thing you may want to check is if the jar has been provided to all the executors in your cluster. Most of the class not found errors got resolved for me after making required jars available in the SparkContext. Thanks. From: Ted Yu > Date:

Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Ted Yu
KafkaLZ4BlockOutputStream is in kafka-clients jar : $ jar tvf kafka-clients-0.8.2.0.jar | grep KafkaLZ4BlockOutputStream 1609 Wed Jan 28 22:30:36 PST 2015 org/apache/kafka/common/message/KafkaLZ4BlockOutputStream$BD.class 2918 Wed Jan 28 22:30:36 PST 2015

Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Siva
Hi Everyone, All of sudden we are encountering the below error from one of the spark consumer. It used to work before without any issues. When I restart the consumer with latest offsets, it is working fine for sometime (it executed few batches) and it fails again, this issue is intermittent.

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Tristan Nixon
Right, well I don’t think the issue is with how you’re compiling the scala. I think it’s a conflict between different versions of several libs. I had similar issues with my spark modules. You need to make sure you’re not loading a different version of the same lib that is clobbering another

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Vasu Parameswaran
Added these to the pom and still the same error :-(. I will look into sbt as well. On Fri, Mar 11, 2016 at 2:31 PM, Tristan Nixon wrote: > You must be relying on IntelliJ to compile your scala, because you haven’t > set up any scala plugin to compile it from maven. >

Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
So I think in your case you’d do something more like: val jsontrans = new JsonSerializationTransformer[StructType].setInputCol(“event").setOutputCol(“eventJSON") > On Mar 11, 2016, at 3:51 PM, Tristan Nixon wrote: > > val jsontrans = new >

Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
you need to set yarn.scheduler.minimum-allocation-mb=32 otherwise Spark AM container will be running on dedicated box instead of running together with the executor container on one of the boxes for slaves I use Amazon EC2 r3.2xlarge box (61GB / 8 cores) - cost ~$0.10 / hour (spot instance)

Repeating Records w/ Spark + Avro?

2016-03-11 Thread Chris Miller
I have a bit of a strange situation: * import org.apache.avro.generic.{GenericData, GenericRecord} import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper, AvroKey} import org.apache.avro.mapreduce.AvroKeyInputFormat import org.apache.hadoop.io.{NullWritable, WritableUtils}

Re: YARN process with Spark

2016-03-11 Thread Mich Talebzadeh
Thanks Koert and Alexander I think the yarn configuration parameters in yarn-site,xml are important. For those I have yarn.nodemanager.resource.memory-mb Amount of max physical memory, in MB, that can be allocated for YARN containers. 8192 yarn.nodemanager.vmem-pmem-ratio Ratio

Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
Forgot to mention. To avoid unnecessary container termination add the following setting to yarn yarn.nodemanager.vmem-check-enabled = false

Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
YARN cores are virtual cores which are used just to calculate available resources. But usually memory is used to manage yarn resources (not cores) spark executor memory should be ~90% of yarn.scheduler.maximum-allocation-mb (which should be the same as yarn.nodemanager.resource.memory-mb) ~10%

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-11 Thread Tristan Nixon
I recommend you package all your dependencies (jars, .so’s, etc.) into a single uber-jar and then submit that. It’s much more convenient than trying to manage including everything in the --jars arg of spark-submit. If you build with maven than the shade plugin will do this for you nicely:

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Tristan Nixon
You must be relying on IntelliJ to compile your scala, because you haven’t set up any scala plugin to compile it from maven. You should have something like this in your plugins: net.alchim31.maven scala-maven-plugin scala-compile-first process-resources compile

Re: YARN process with Spark

2016-03-11 Thread Koert Kuipers
you get a spark executor per yarn container. the spark executor can have multiple cores, yes. this is configurable. so the number of partitions that can be processed in parallel is num-executors * executor-cores. and for processing a partition the available memory is executor-memory /

YARN process with Spark

2016-03-11 Thread Mich Talebzadeh
Hi, Can these be clarified please 1. Can a YARN container use more than one core and if this is configurable? 2. A YARN container is constraint to 8MB by " yarn.scheduler.maximum-allocation-mb". If a YARN container is a Spark process will that limit also include the memory Spark

Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
It’s pretty simple, really: import com.fasterxml.jackson.databind.ObjectMapper import org.apache.spark.ml.UnaryTransformer import org.apache.spark.ml.util.Identifiable import org.apache.spark.sql.types.{DataType, StringType} /** * A SparkML Transformer that will transform an * entity of type T

Re: adding rows to a DataFrame

2016-03-11 Thread Bijay Pathak
Here is another way you can achieve that(in Python): base_df.withColumn("column_name","column_expression_for_new_column") # To add new row create the data frame containing the new row and do the unionAll() base_df.unionAll(new_df) # Another approach convert to rdd add required fields and convert

Re: adding rows to a DataFrame

2016-03-11 Thread Jan Štěrba
It very much depends on the logic that generates the new rows. Is it per row (i.e. without context?) then you can just convert to RDD and perform a map operation on each row. JavaPairRDD grouped = dataFrame.javaRDD().groupBy( group by what you need, probably ID ); return

sliding Top N window

2016-03-11 Thread Yakubovich, Alexey
Good day, I have a following task: a stream of “page vies” coming to kafka topic. Each view contains list of product Ids from a visited page. The task: to have in “real time” Top N product. I am interested in some solution that would require minimum intermediate writes … So need to build a

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-11 Thread Jacek Laskowski
Hi, For jars use spark-submit --jars. Dunno about so's. Could that work through jars? Jacek 11.03.2016 8:07 PM "prateek arora" napisał(a): > Hi > > I have multiple node cluster and my spark jobs depend on a native > library (.so files) and some jar files. > > Can

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Jacek Laskowski
Hi, Doh! My eyes are bleeding to go through XMLs...  Where did you specify Scala version? Dunno how it's in maven. p.s. I *strongly* recommend sbt. Jacek 11.03.2016 8:04 PM "Vasu Parameswaran" napisał(a): > Thanks Jacek. Pom is below (Currenlty set to 1.6.1 spark but I

Re: Spark property parameters priority

2016-03-11 Thread Jacek Laskowski
Hi It could also be conf/spark-defaults.conf. Jacek 11.03.2016 8:07 PM "Cesar Flores" napisał(a): > > Right now I know of three different things to pass property parameters to > the Spark Context. They are: > >- A) Inside a SparkConf object just before creating the Spark

Re: udf StructField to JSON String

2016-03-11 Thread Caires Vinicius
I would like to see the code as well Tristan! On Fri, Mar 11, 2016 at 1:53 PM Tristan Nixon wrote: > Have you looked at DataFrame.write.json( path )? > > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter > > > On Mar 11, 2016,

Re: adding rows to a DataFrame

2016-03-11 Thread Michael Armbrust
Or look at explode on DataFrame On Fri, Mar 11, 2016 at 10:45 AM, Stefan Panayotov wrote: > Hi, > > I have a problem that requires me to go through the rows in a DataFrame > (or possibly through rows in a JSON file) and conditionally add rows > depending on a value in one of

Re: udf StructField to JSON String

2016-03-11 Thread Michael Armbrust
df.select("event").toJSON On Fri, Mar 11, 2016 at 9:53 AM, Caires Vinicius wrote: > Hmm. I think my problem is a little more complex. I'm using > https://github.com/databricks/spark-redshift and when I read from JSON > file I got this schema. > > root > > |-- app: string

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Gayathri Murali
Thanks Josh. I am able to run with Python 2.7 explicitly by specifying --python-executables=python2.7. By default it checks only for Python2.6. Thanks Gayathri On Fri, Mar 11, 2016 at 10:35 AM, Josh Rosen wrote: > AFAIK we haven't actually broken 2.6 compatibility

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
ok, some more information (and presumably a workaround). when I initial read in my file, I use the following code. JavaRDD keyFileRDD = sc.textFile(keyFile) Looking at the UI, this file has 2 partitions (both on the same executor). I then subsequently repartition this RDD (to 16)

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
Thanks for the pointer to those indexers, those are some good examples. A good way to go for the trainer and any scoring done in Spark. I will definitely have to deal with scoring in non-Spark systems though. I think I will need to scale up beyond what single-node liblinear can practically

How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-11 Thread prateek arora
Hi I have multiple node cluster and my spark jobs depend on a native library (.so files) and some jar files. Can some one please explain what are the best ways to distribute dependent files across nodes? right now i copied dependent files in all nodes using chef tool . Regards Prateek --

Spark property parameters priority

2016-03-11 Thread Cesar Flores
Right now I know of three different things to pass property parameters to the Spark Context. They are: - A) Inside a SparkConf object just before creating the Spark Context - B) During job submission (i.e. --conf spark.driver.memory.memory = 2g) - C) By using a specific property file

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Vasu Parameswaran
Thanks Jacek. Pom is below (Currenlty set to 1.6.1 spark but I started out with 1.6.0 with the same problem). http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
My driver code has the following: // Init S3 (workers) so we can read the assets partKeyFileRDD.foreachPartition(new SimpleStorageServiceInit(arg1, arg2, arg3)); // Get the assets. Create a key pair where the key is asset id and the value is the rec. JavaPairRDD seqFileRDD =

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Vasu Parameswaran
Thanks Ted. I haven't explicitly specified Scala (I tried different versions in pom.xml as well). For what it is worth, this is what I get when I do a maven dependency tree. I wonder if the 2.11.2 coming from scala-reflect matters: [INFO] | | \- org.scala-lang:scalap:jar:2.11.0:compile

Re: adding rows to a DataFrame

2016-03-11 Thread Jacek Laskowski
Just a guess...flatMap? Jacek 11.03.2016 7:46 PM "Stefan Panayotov" napisał(a): > Hi, > > I have a problem that requires me to go through the rows in a DataFrame > (or possibly through rows in a JSON file) and conditionally add rows > depending on a value in one of the

Re: udf StructField to JSON String

2016-03-11 Thread Jacek Laskowski
Hi Tristan, Mind sharing the relevant code? I'd like to learn the way you use Transformer to do so. Thanks! Jacek 11.03.2016 7:07 PM "Tristan Nixon" napisał(a): > I have a similar situation in an app of mine. I implemented a custom ML > Transformer that wraps the Jackson

Re: Graphx

2016-03-11 Thread Khaled Ammar
This is an interesting discussion, I have had some success running GraphX on large graphs with more than a Billion edges using clusters of different size up to 64 machines. However, the performance goes down when I double the cluster size to reach 128 machines of r3.xlarge. Does any one have

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Jacek Laskowski
Hi, Why do you use maven not sbt for Scala? Can you show the entire pom.xml and the command to execute the app? Jacek 11.03.2016 7:33 PM "vasu20" napisał(a): > Hi > > Any help appreciated on this. I am trying to write a Spark program using > IntelliJ. I get a run time

adding rows to a DataFrame

2016-03-11 Thread Stefan Panayotov
Hi, I have a problem that requires me to go through the rows in a DataFrame (or possibly through rows in a JSON file) and conditionally add rows depending on a value in one of the columns in each existing row. So, for example if I have: +---+---+---+ | _1| _2| _3| +---+---+---+

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski
Hi, Could you share the code with foreachPartition? Jacek 11.03.2016 7:33 PM "Darin McBeath" napisał(a): > > > I can verify this by looking at the log file for the workers. > > Since I output logging statements in the object called by the > foreachPartition, I can see the

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Ted Yu
Looks like Scala version mismatch. Are you using 2.11 everywhere ? On Fri, Mar 11, 2016 at 10:33 AM, vasu20 wrote: > Hi > > Any help appreciated on this. I am trying to write a Spark program using > IntelliJ. I get a run time error as soon as new SparkConf() is called from

hive.metastore.metadb.dir not working programmatically

2016-03-11 Thread harirajaram
Experts need your help, I'm using spark 1.4.1 and when set this hive.metastore.metadb.dir programmatically for a hivecontext i.e for local metastore i.e the default metastore_db for derby, the metastore_db is still getting creating in the same path as user.dir. Can you guys provide some insights

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Josh Rosen
AFAIK we haven't actually broken 2.6 compatibility yet for PySpark itself, since Jenkins is still testing that configuration. I think the problem that you're seeing is that dev/run-tests / dev/run-tests-jenkins only work against Python 2.7+ right now. However, ./python/run-tests should be able to

Newbie question - Help with runtime error on augmentString

2016-03-11 Thread vasu20
Hi Any help appreciated on this. I am trying to write a Spark program using IntelliJ. I get a run time error as soon as new SparkConf() is called from main. Top few lines of the exception are pasted below. These are the following versions: Spark jar: spark-assembly-1.6.0-hadoop2.6.0.jar

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
I can verify this by looking at the log file for the workers. Since I output logging statements in the object called by the foreachPartition, I can see the statements being logged. Oddly, these output statements only occur in one executor (and not the other). It occurs 16 times in this

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Holden Karau
So the run tests command allows you to specify the python version to test again - maybe specify python2.7 On Friday, March 11, 2016, Gayathri Murali wrote: > I do have 2.7 installed and unittest2 package available. I still see this > error : > > Please install

RE: Graphx

2016-03-11 Thread John Lilley
We have almost zero node info – just an identifying integer. John Lilley From: Alexis Roos [mailto:alexis.r...@gmail.com] Sent: Friday, March 11, 2016 11:24 AM To: Alexander Pivovarov Cc: John Lilley ; Ovidiu-Cristian MARCU

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski
Hi, How do you check which executor is used? Can you include a screenshot of the master's webUI with workers? Jacek 11.03.2016 6:57 PM "Darin McBeath" napisał(a): > I've run into a situation where it would appear that foreachPartition is > only running on one of my

Re: Graphx

2016-03-11 Thread Alexis Roos
Also we keep the Node info minimal as needed for connected components and rejoin later. Alexis On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov wrote: > we use it in prod > > 70 boxes, 61GB RAM each > > GraphX Connected Components works fine on 250M Vertices and 1B

RE: Graphx

2016-03-11 Thread John Lilley
Thanks Alexander, this is really good information. However it reinforces that we cannot use GraphX, because our customers typically have on-prem clusters in the 10-node range. Very few have the kind of horsepower you are talking about. We can’t just tell them to quadruple their cluster size

Re: Is there Graph Partitioning impl for Scala/Spark?

2016-03-11 Thread Alexander Pivovarov
JUNG library has 4 Community Detection (Community Structure) algorithms implemented including Girvan–Newman algorithm (EdgeBetweennessClusterer.java) https://github.com/jrtom/jung/tree/master/jung-algorithms/src/main/java/edu/uci/ics/jung/algorithms/cluster Girvan–Newman algorithm paper

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Gayathri Murali
I do have 2.7 installed and unittest2 package available. I still see this error : Please install unittest2 to test with Python 2.6 or earlier Had test failures in pyspark.sql.tests with python2.6; see logs. Thanks Gayathri On Fri, Mar 11, 2016 at 10:07 AM, Davies Liu

Re: Graphx

2016-03-11 Thread Alexander Pivovarov
we use it in prod 70 boxes, 61GB RAM each GraphX Connected Components works fine on 250M Vertices and 1B Edges (takes about 5-10 min) Spark likes memory, so use r3.2xlarge boxes (61GB) For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge (30.5 GB) (especially if you have

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Davies Liu
Spark 2.0 is dropping the support for Python 2.6, it only work with Python 2.7, and 3.4+ On Thu, Mar 10, 2016 at 11:17 PM, Gayathri Murali wrote: > Hi all, > > I am trying to run python unit tests. > > I currently have Python 2.6 and 2.7 installed. I installed

Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
I have a similar situation in an app of mine. I implemented a custom ML Transformer that wraps the Jackson ObjectMapper - this gives you full control over how your custom entities / structs are serialized. > On Mar 11, 2016, at 11:53 AM, Caires Vinicius wrote: > > Hmm. I

spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
I've run into a situation where it would appear that foreachPartition is only running on one of my executors. I have a small cluster (2 executors with 8 cores each). When I run a job with a small file (with 16 partitions) I can see that the 16 partitions are initialized but they all appear to

Re: udf StructField to JSON String

2016-03-11 Thread Caires Vinicius
Hmm. I think my problem is a little more complex. I'm using https://github.com/databricks/spark-redshift and when I read from JSON file I got this schema. root |-- app: string (nullable = true) |-- ct: long (nullable = true) |-- event: struct (nullable = true) ||-- attributes: struct

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Nick Pentreath
Ok, I think I understand things better now. For Spark's current implementation, you would need to map those features as you mention. You could also use say StringIndexer -> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with the mapping and training (e.g.

Re: How can I join two DataSet of same case class?

2016-03-11 Thread Jacek Laskowski
Hi, Use the names of the datasets not $, i. e. a("edid"). Jacek 11.03.2016 6:09 AM "박주형" napisał(a): > Hi. I want to join two DataSet. but below stderr is shown > > 16/03/11 13:55:51 WARN ColumnName: Constructing trivially true equals > predicate, ''edid = 'edid'. Perhaps

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
Why are you including a specific dependency on Kafka? Spark's external streaming kafka module already depends on kafka. Can you link to an actual repo with build file etc? On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta wrote: > Please note that while building jar of

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Mukul Gupta
Please note that while building jar of code below, i used spark 1.6.0 + kafka 0.9.0.0 libraries I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the same issue. I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which matches with spark and kafka versions

Re: Get output of the ALS algorithm.

2016-03-11 Thread Jacek Laskowski
What about write.save(file)? P.s. I'm new to Spark MLlib. 11.03.2016 4:57 AM "Shishir Anshuman" napisał(a): > hello, > > I am new to Apache Spark and would like to get the Recommendation output > of the ALS algorithm in a file. > Please suggest me the solution. > >

Re: Doubt on data frame

2016-03-11 Thread Mich Talebzadeh
Temporary tables are created in temp file space within the session. Once the session is closed then the temporary table goes scala> rs.registerTempTable("mytemp") And this is the temporary file created with the above command drwx-- - hdusersupergroup 0 2016-03-11 17:09

Re: Get output of the ALS algorithm.

2016-03-11 Thread Bryan Cutler
Are you trying to save predictions on a dataset to a file, or the model produced after training with ALS? On Thu, Mar 10, 2016 at 7:57 PM, Shishir Anshuman wrote: > hello, > > I am new to Apache Spark and would like to get the Recommendation output > of the ALS

Re: Does Spark support in-memory shuffling?

2016-03-11 Thread Ted Yu
Please take a look at SPARK-3376 and discussion on https://github.com/apache/spark/pull/5403 FYI On Fri, Mar 11, 2016 at 6:37 AM, Xudong Zheng wrote: > Hi all, > > Does Spark support in-memory shuffling now? If not, is there any > consideration for it? > > Thanks! > > -- >

Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
Have you looked at DataFrame.write.json( path )? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter > On Mar 11, 2016, at 7:15 AM, Caires Vinicius wrote: > > I have one DataFrame with nested StructField and I want to convert to

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath wrote: > Would you mind letting us know the # training examples in the datasets? > Also, what do your features look like? Are they text, categorical etc? You > mention that most rows only have a few features, and all rows

Re: How can I join two DataSet of same case class?

2016-03-11 Thread Xinh Huynh
I think you have to use an alias. To provide an alias to a Dataset: val d1 = a.as("d1") val d2 = b.as("d2") Then join, using the alias in the column names: d1.joinWith(d2, $"d1.edid" === $"d2.edid") Finally, please doublecheck your column names. I did not see "edid" in your case class. Xinh

Re: Spark configuration with 5 nodes

2016-03-11 Thread Mich Talebzadeh
Hi Steve, My argument has always been that if one is going to use Solid State Disks (SSD), it makes sense to have it for NN disks start-up from fsimage etc. Obviously NN lives in memory. Would you also rerommend RAID10 (mirroring & striping) for NN disks? Thanks Dr Mich Talebzadeh

RE: Graphx

2016-03-11 Thread John Lilley
I suppose for a 2.6bn case we’d need Long: public class GenCCInput { public static void main(String[] args) { if (args.length != 2) { System.err.println("Usage: \njava GenCCInput "); System.exit(-1); } long edges = Long.parseLong(args[0]); long groupSize =

RE: Graphx

2016-03-11 Thread John Lilley
PS: This is the code I use to generate clustered test dat: public class GenCCInput { public static void main(String[] args) { if (args.length != 2) { System.err.println("Usage: \njava GenCCInput "); System.exit(-1); } int edges = Integer.parseInt(args[0]); int

RE: Graphx

2016-03-11 Thread John Lilley
Ovidiu, IMHO, this is one of the biggest issues facing GraphX and Spark. There are a lot of knobs and levers to pull to affect performance, with very little guidance about which settings work in general. We cannot ship software that requires end-user tuning; it just has to work.

Re: Doubt on data frame

2016-03-11 Thread ram kumar
No, I am not aware of it. Can you provide me with the details regarding this. Thanks On Fri, Mar 11, 2016 at 8:25 PM, Ted Yu wrote: > temporary tables are associated with SessionState which is used > by SQLContext. > > Did you keep the session ? > > Cheers > > On Fri, Mar

Re: Graphx

2016-03-11 Thread Ovidiu-Cristian MARCU
Hi, I wonder what version of Spark and different parameter configuration you used. I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters) John: I suppose your C++ app (algorithm) does not scale if you used

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
Can you post your actual code? On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta wrote: > Hi All, I was running the following test: Setup 9 VM runing spark workers > with 1 spark executor each. 1 VM running kafka and spark master. Spark > version is 1.6.0 Kafka version is

Re: Graphx

2016-03-11 Thread lihu
Hi, John: I am very intersting in your experiment, How can you get that RDD serialization cost lots of time, from the log or some other tools? On Fri, Mar 11, 2016 at 8:46 PM, John Lilley wrote: > Andrew, > > > > We conducted some tests for using Graphx to solve

RE: Graphx

2016-03-11 Thread John Lilley
A colleague did the experiments and I don’t know exactly how he observed that. I think it was indirect from the Spark diagnostics indicating the amount of I/O he deduced that this was RDD serialization. Also when he added light compression to RDD serialization this improved matters. John

Re: Doubt on data frame

2016-03-11 Thread Ted Yu
temporary tables are associated with SessionState which is used by SQLContext. Did you keep the session ? Cheers On Fri, Mar 11, 2016 at 5:02 AM, ram kumar wrote: > Hi, > > I registered a dataframe as a table using registerTempTable > and I didn't close the Spark

Does Spark support in-memory shuffling?

2016-03-11 Thread Xudong Zheng
Hi all, Does Spark support in-memory shuffling now? If not, is there any consideration for it? Thanks! -- Xudong Zheng

Re: Spark on YARN memory consumption

2016-03-11 Thread Jan Štěrba
Thanks that explains a lot. -- Jan Sterba https://twitter.com/honzasterba | http://flickr.com/honzasterba | http://500px.com/honzasterba On Fri, Mar 11, 2016 at 2:36 PM, Silvio Fiorito wrote: > Hi Jan, > > > > Yes what you’re seeing is due to YARN container memory

RE: Spark on YARN memory consumption

2016-03-11 Thread Silvio Fiorito
Hi Jan, Yes what you’re seeing is due to YARN container memory overhead. Also, typically the memory increments for YARN containers is 1GB. This gives a good overview: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Thanks, Silvio From: Jan

Spark on YARN memory consumption

2016-03-11 Thread Jan Štěrba
Hello, I am exprimenting with tuning an on demand spark-cluster on top of our cloudera hadoop. I am running Cloudera 5.5.2 with Spark 1.5 right now and I am running spark in yarn-client mode. Right now my main experimentation is about spark.executor.memory property and I have noticed a strange

udf StructField to JSON String

2016-03-11 Thread Caires Vinicius
I have one DataFrame with nested StructField and I want to convert to JSON String. There is anyway to accomplish this?

Re: Can we use spark inside a web service?

2016-03-11 Thread Andrés Ivaldi
nice discussion , I've a question about Web Service with Spark. What Could be the problem using Akka-http as web service (Like play does ) , with one SparkContext created , and the queries over -http akka using only the instance of that SparkContext , Also about Analytics , we are working on

Doubt on data frame

2016-03-11 Thread ram kumar
Hi, I registered a dataframe as a table using registerTempTable and I didn't close the Spark context. Will the table be available for longer time? Thanks

RE: Graphx

2016-03-11 Thread John Lilley
Andrew, We conducted some tests for using Graphx to solve the connected-components problem and were disappointed. On 8 nodes of 16GB each, we could not get above 100M edges. On 8 nodes of 60GB each, we could not process 1bn edges. RDD serialization would take excessive time and then we

BlockFetchFailed Exception

2016-03-11 Thread Priya Ch
Hi All, I am trying to run spark k-means on a data set which is closely to 1 GB. Most often I seen BlockFetchFailed Exception which I am suspecting because of Out of memory. Here the configuration details- Total cores:12 Total workers:3 Memory per node: 6GB When running the job, I an giving

  1   2   >