Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-05 Thread Rahul Bindlish
Tobias, Understand and thanks for quick resolution of problem. Thanks ~Rahul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20446.html Sent from the Apache Spark User List mailing list

Profiling GraphX codes.

2014-12-05 Thread Deep Pradhan
Is there any tool to profile GraphX codes in a cluster? Is there a way to know the messages exchanged among the nodes in a cluster? WebUI does not give all the information. Thank You

R: Clarifications on Spark

2014-12-05 Thread Paolo Platter
Hi, 1) yes you can. Spark is supporting a lot of file formats on hdfs/s3 then is supporting cassandra and jdbc in General. 2) yes. Spark has a jdbc thrift server where you can attach BI tools. I suggest to you to pay attention to your Query response time requirements. 3) no you can go with

Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread LinQili
Hi, all: According to https://github.com/apache/spark/pull/2732, When a spark job fails or exits nonzero in yarn-cluster mode, the spark-submit will get the corresponding return code of the spark job. But I tried in spark-1.1.1 yarn cluster, spark-submit return zero anyway. Here is my spark

RE: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread LinQili
I tried in spark client mode, spark-submit can get the correct return code from spark job. But in yarn-cluster mode, It failed. From: lin_q...@outlook.com To: u...@spark.incubator.apache.org Subject: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode Date: Fri, 5

Re: SQL query in scala API

2014-12-05 Thread Cheng Lian
Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write you own aggregation with |aggregateByKey|: |users.aggregateByKey((0,Set.empty[String]))({case ((count, seen), user) = (count +1, seen + user) }, {case ((count0, seen0), (count1, seen1)) = (count0 + count1, seen0 ++

RE: Spark streaming for v1.1.1 - unable to start application

2014-12-05 Thread Shao, Saisai
Hi, I don’t think it’s a problem of Spark Streaming, seeing for call stack, it’s the problem when BlockManager starting to initializing itself. Would you mind checking your configuration of Spark, hardware problem, deployment. Mostly I think it’s not the problem of Spark. Thanks Saisai From:

RE: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread LinQili
I tried anather test code: def main(args: Array[String]) {if (args.length != 1) { Util.printLog(ERROR, Args error - arg1: BASE_DIR) exit(101) }val currentFile = args(0).toStringval DB = test_spark val tableName = src val sparkConf = new

Re: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread Shixiong Zhu
What's the status of this application in the yarn web UI? Best Regards, Shixiong Zhu 2014-12-05 17:22 GMT+08:00 LinQili lin_q...@outlook.com: I tried anather test code: def main(args: Array[String]) { if (args.length != 1) { Util.printLog(ERROR, Args error - arg1: BASE_DIR)

Re: NullPointerException When Reading Avro Sequence Files

2014-12-05 Thread cjdc
Hi all, I've tried the above example on Gist, but it doesn't work (at least for me). Did anyone get this: 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class

Increasing the number of retry in case of job failure

2014-12-05 Thread shahab
Hello, By some (unknown) reasons some of my tasks, that fetch data from Cassandra, are failing so often, and apparently the master removes a tasks which fails more than 4 times (in my case). Is there any way to increase the number of re-tries ? best, /Shahab

[Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Yifan LI
Hi, I have a graph in where each vertex keep several messages to some faraway neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. k = 5). now, I propose to distribute these messages to their corresponding destinations(say, faraway neighbours”): - by using pregel

scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Hao Ren
Hi, I am using SparkSQL on 1.1.0 branch. The following code leads to a scala.MatchError at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) val scm = StructType(inputRDD.schema.fields.init :+ StructField(list, ArrayType( StructType(

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Jay Vyas
Here's an example of a Cassandra etl that you can follow which should exit on its own. I'm using it as a blueprint for revolving spark streaming apps on top of. For me, I kill the streaming app w system.exit after a sufficient amount of data is collected. That seems to work for most any

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-05 Thread sahanbull
I worked man.. Thanks alot :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20461.html Sent from the Apache Spark User List mailing list archive at

How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Emre Sevinc
Hello, I'm currently developing a Spark Streaming application and trying to write my first unit test. I've used Java for this application, and I also need use Java (and JUnit) for writing unit tests. I could not find any documentation that focuses on Spark Streaming unit testing, all I could

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-05 Thread Daniel Darabos
Hi, Alexey, I'm getting the same error on startup with Spark 1.1.0. Everything works fine fortunately. The error is mentioned in the logs in https://issues.apache.org/jira/browse/SPARK-4498, so maybe it will also be fixed in Spark 1.2.0 and 1.1.2. I have no insight into it unfortunately. On Tue,

Re: Does filter on an RDD scan every data item ?

2014-12-05 Thread nsareen
Any thoughts, how could Spark SQL help in our scenario ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20465.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Helena Edelson
You can just do You can just do something like this, the Spark Cassandra Connector handles the rest KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Map(KafkaTopicRaw - 10), StorageLevel.DISK_ONLY_2) .map { case (_, line) = line.split(,)}

cartesian on pyspark not paralleised

2014-12-05 Thread Antony Mayi
Hi, using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel - I can seen multiple python processes spawned on each nodemanager but from some reason when running cartesian there is only single python process running on each node. the task is indicating thousands of partitions

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Daniel Darabos
It is controlled by spark.task.maxFailures. See http://spark.apache.org/docs/latest/configuration.html#scheduling. On Fri, Dec 5, 2014 at 11:02 AM, shahab shahab.mok...@gmail.com wrote: Hello, By some (unknown) reasons some of my tasks, that fetch data from Cassandra, are failing so often,

Re: Market Basket Analysis

2014-12-05 Thread Sean Owen
Generally I don't think frequent-item-set algorithms are that useful. They're simple and not probabilistic; they don't tell you what sets occurred unusually frequently. Usually people ask for frequent item set algos when they really mean they want to compute item similarity or make

spark streaming kafa best practices ?

2014-12-05 Thread david
hi, What is the bet way to process a batch window in SparkStreaming : kafkaStream.foreachRDD(rdd = { rdd.collect().foreach(event = { // process the event process(event) }) }) Or kafkaStream.foreachRDD(rdd = { rdd.map(event = { //

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-05 Thread Daniel Darabos
On Fri, Dec 5, 2014 at 7:12 AM, Tobias Pfeiffer t...@preferred.jp wrote: Rahul, On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: I have done so thats why spark is able to load objectfile [e.g. person_obj] and spark has maintained serialVersionUID

Re: How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Ted Yu
Please specify '-DskipTests' on commandline. Cheers On Dec 5, 2014, at 3:52 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, I'm currently developing a Spark Streaming application and trying to write my first unit test. I've used Java for this application, and I also need use Java

subscribe me to the list

2014-12-05 Thread Wang, Ningjun (LNG-NPV)
I would like to subscribe to the user@spark.apache.orgmailto:user@spark.apache.org Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541

Re: subscribe me to the list

2014-12-05 Thread 张鹏
Hi Ningjun Please send email to this address to get subscribed: user-subscr...@spark.apache.org On Dec 5, 2014, at 10:36 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I would like to subscribe to the user@spark.apache.org Regards, Ningjun Wang Consulting Software

Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to run some spark job with spark-shell. What I want to do is just to count the number of lines in a file. I start the spark-shell with the default argument i.e just with ./bin/spark-shell. Load the text file with sc.textFile(path) and then call count on my data. When I do

Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Sean Owen
How big is your file? it's probably of a size that the Hadoop InputFormat would make 52 splits for it. Data drives partitions, not processing resource. Really, 8 splits is the minimum parallelism you want. Several times your # of cores is better. On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread m.sarosh
Hi Akhil, Vyas, Helena, Thank you for your suggestions. As Akhil suggested earlier, i have implemented the batch Duration into JavaStreamingContext and waitForTermination(Duration). The approach Helena suggested is Scala oriented. But the issue now is that I want to set Cassandra as my output.

Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Ok, I misunderstood the meaning of the partition. In fact, my file is 1.7G big and with less bigger file I have a different partitions size. Thanks for this clarification. On Fri, Dec 5, 2014 at 4:15 PM, Sean Owen so...@cloudera.com wrote: How big is your file? it's probably of a size that the

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Helena Edelson
I think what you are looking for is something like: JavaRDDDouble pricesRDD = javaFunctions(sc).cassandraTable(ks, tab, mapColumnTo(Double.class)).select(price); JavaRDDPerson rdd = javaFunctions(sc).cassandraTable(ks, people, mapRowTo(Person.class)); noted here:

Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to a run clustering with kmeans algorithm. The size of my data set is about 240k vectors of dimension 384. Solving the problem with the kmeans available in julia (kmean++) http://clusteringjl.readthedocs.org/en/latest/kmeans.html take about 8 minutes on a single core.

Re: How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Emre Sevinc
Hello, Specifying '-DskipTests' on commandline worked, though I can't be sure whether first running 'sbt assembly' also contributed to the solution. (I've tried 'sbt assembly' because branch-1.1's README says to use sbt). Thanks for the answer. Kind regards, Emre Sevinç

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Sean Owen
Spark has much more overhead, since it's set up to distribute the computation. Julia isn't distributed, and so has no such overhead in a completely in-core implementation. You generally use Spark when you have a problem large enough to warrant distributing, or, your data already lives in a

pyspark exception catch

2014-12-05 Thread Igor Mazor
Hi , Is it possible to catch exceptions using pyspark so in case of error, the program will not fail and exit. for example if I am using (key, value) rdd functionality but the data don't have actually (key, value) format, pyspark will throw exception (like ValueError) that I am unable to catch.

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread m.sarosh
Thank you Helena, But I would like to explain my problem space: The output is supposed to be Cassandra. To achieve that, I have to use spark-cassandra-connecter APIs. So going in a botton-up approach, to write to cassandra, I have to use: javaFunctions(JavaRDD object rdd,

Re: Market Basket Analysis

2014-12-05 Thread Rohit Pujari
This is a typical use case people who buy electric razors, also tend to buy batteries and shaving gel along with it. The goal is to build a model which will look through POS records and find which product categories have higher likelihood of appearing together in given a transaction. What would

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Davies Liu
Could you post you script to reproduce the results (also how to generate the dataset)? That will help us to investigate it. On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hmm, here I use spark on local mode on my laptop with 8 cores. The data is on my local

Spark Streaming Reusing JDBC Connections

2014-12-05 Thread Asim Jalis
Is there a way I can have a JDBC connection open through a streaming job. I have a foreach which is running once per batch. However, I don’t want to open the connection for each batch but would rather have a persistent connection that I can reuse. How can I do this? Thanks. Asim

Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Hi, Seems adding the cassandra connector and spark streaming causes issues. I've added by build and code file. Running sbt compile gives weird errors like Seconds is not part of org.apache.spark.streaming and object Receiver is not a member of package org.apache.spark.streaming.receiver. If I

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
The code is really simple : *object TestKMeans {* * def main(args: Array[String]) {* *val conf = new SparkConf()* * .setAppName(Test KMeans)* * .setMaster(local[8])* * .set(spark.executor.memory, 8g)* *val sc = new SparkContext(conf)* *val numClusters = 500;* *

RE: Spark Streaming Reusing JDBC Connections

2014-12-05 Thread Ashic Mahtab
I've done this: 1. foreachPartition 2. Open connection. 3. foreach inside the partition. 4. close the connection. Slightly crufty, but works. Would love to see a better approach. Regards, Ashic. Date: Fri, 5 Dec 2014 12:32:24 -0500 Subject: Spark Streaming Reusing JDBC Connections From:

Optimized spark configuration

2014-12-05 Thread vdiwakar.malladi
Hi Could any one help what would be better / optimized configuration for driver memory, worker memory, number of parallelisms etc., parameters to be configured when we are running 1 master node (it itself acting as slave node also) and 1 slave node. Both are of 32 GB RAM with 4 cores. On this, I

I am having problems reading files in the 4GB range

2014-12-05 Thread Steve Lewis
I am using a custom hadoop input format which works well on smaller files but fails with a file at about 4GB size - the format is generating about 800 splits and all variables in my code are longs - Any suggestions? Is anyone reading files of this size? Exception in thread main

Re: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ted Yu
Can you try with maven ? diff --git a/streaming/pom.xml b/streaming/pom.xml index b8b8f2e..6cc8102 100644 --- a/streaming/pom.xml +++ b/streaming/pom.xml @@ -68,6 +68,11 @@ artifactIdjunit-interface/artifactId scopetest/scope /dependency +dependency +

Re: Spark streaming for v1.1.1 - unable to start application

2014-12-05 Thread Andrew Or
Hey Sourav, are you able to run a simple shuffle in a spark-shell? 2014-12-05 1:20 GMT-08:00 Shao, Saisai saisai.s...@intel.com: Hi, I don’t think it’s a problem of Spark Streaming, seeing for call stack, it’s the problem when BlockManager starting to initializing itself. Would you mind

Re: Unable to run applications on clusters on EC2

2014-12-05 Thread Andrew Or
Hey, the default port is 7077. Not sure if you actually meant to put 7070. As a rule of thumb, you can go to the Master web UI and copy and paste the URL at the top left corner. That almost always works unless your cluster has a weird proxy set up. 2014-12-04 14:26 GMT-08:00 Xingwei Yang

Re: spark-submit on YARN is slow

2014-12-05 Thread Andrew Or
Hey Tobias, As you suspect, the reason why it's slow is because the resource manager in YARN takes a while to grant resources. This is because YARN needs to first set up the application master container, and then this AM needs to request more containers for Spark executors. I think this accounts

Re: [Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Ankur Dave
At 2014-12-05 02:26:52 -0800, Yifan LI iamyifa...@gmail.com wrote: I have a graph in where each vertex keep several messages to some faraway neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. k = 5). now, I propose to distribute these messages to their

Re: spark-submit on YARN is slow

2014-12-05 Thread Sandy Ryza
Hi Tobias, What version are you using? In some recent versions, we had a couple of large hardcoded sleeps on the Spark side. -Sandy On Fri, Dec 5, 2014 at 11:15 AM, Andrew Or and...@databricks.com wrote: Hey Tobias, As you suspect, the reason why it's slow is because the resource manager

Re: Monitoring Spark

2014-12-05 Thread Andrew Or
If you're only interested in a particular instant, a simpler way is to check the executors page on the Spark UI: http://spark.apache.org/docs/latest/monitoring.html. By default each executor runs one task per core, so you can see how many tasks are being run at a given time and this translates

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Sorry...really don't have enough maven know how to do this quickly. I tried the pom below, and IntelliJ could find org.apache.spark.streaming.StreamingContext and org.apache.spark.streaming.Seconds, but not org.apache.spark.streaming.receiver.Receiver. Is there something specific I can try?

Re: Issue in executing Spark Application from Eclipse

2014-12-05 Thread Andrew Or
Hey Stuti, Did you start your standalone Master and Workers? You can do this through sbin/start-all.sh (see http://spark.apache.org/docs/latest/spark-standalone.html). Otherwise, I would recommend launching your application from the command line through bin/spark-submit. I am not sure if we

Java RDD Union

2014-12-05 Thread Ron Ayoub
I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting

Re: Any ideas why a few tasks would stall

2014-12-05 Thread Andrew Or
Hi Steve et al., It is possible that there's just a lot of skew in your data, in which case repartitioning is a good idea. Depending on how large your input data is and how much skew you have, you may want to repartition to a larger number of partitions. By the way you can just call

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
My submissions of Spark on YARN (CDH 5.2) resulted in a few thousand steps. If I was running this on standalone cluster mode the query finished in 55s but on YARN, the query was still running 30min later. Would the hard coded sleeps potentially be in play here? On Fri, Dec 5, 2014 at 11:23 Sandy

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Andrew Or
Increasing max failures is a way to do it, but it's probably a better idea to keep your tasks from failing in the first place. Are your tasks failing with exceptions from Spark or your application code? If from Spark, what is the stack trace? There might be a legitimate Spark bug such that even

Re: drop table if exists throws exception

2014-12-05 Thread Michael Armbrust
The command run fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code. On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got exception saying Hive: NoSuchObjectException(message:table

Re: drop table if exists throws exception

2014-12-05 Thread Mark Hamstra
And that is no different from how Hive has worked for a long time. On Fri, Dec 5, 2014 at 11:42 AM, Michael Armbrust mich...@databricks.com wrote: The command run fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code.

Re: SchemaRDD partition on specific column values?

2014-12-05 Thread Michael Armbrust
It does not appear that the in-memory caching currently preserves the information about the partitioning of the data so this optimization will probably not work. On Thu, Dec 4, 2014 at 8:42 PM, nitin nitin2go...@gmail.com wrote: With some quick googling, I learnt that I can we can provide

Re: spark-submit on YARN is slow

2014-12-05 Thread Sameer Farooqui
Just an FYI - I can submit the SparkPi app to YARN in cluster mode on a 1-node m3.xlarge EC2 instance instance and the app finishes running successfully in about 40 seconds. I just figured the 30 - 40 sec run time was normal b/c of the submitting overhead that Andrew mentioned. Denny, you can

Re: spark-submit on YARN is slow

2014-12-05 Thread Sandy Ryza
Hi Denny, Those sleeps were only at startup, so if jobs are taking significantly longer on YARN, that should be a different problem. When you ran on YARN, did you use the --executor-cores, --executor-memory, and --num-executors arguments? When running against a standalone cluster, by default

Re: Java RDD Union

2014-12-05 Thread Sean Owen
No, RDDs are immutable. union() creates a new RDD, and does not modify an existing RDD. Maybe this obviates the question. I'm not sure what you mean about releasing from memory. If you want to repartition the unioned RDD, you repartition the result of union(), not anything else. On Fri, Dec 5,

Re: Java RDD Union

2014-12-05 Thread Sameer Farooqui
Hi Ron, Out of curiosity, why do you think that union is modifying an existing RDD in place? In general all transformations, including union, will create new RDDs, not modify old RDDs in place. Here's a quick test: scala val firstRDD = sc.parallelize(1 to 5) firstRDD:

Re: spark-submit on YARN is slow

2014-12-05 Thread Arun Ahuja
Hey Sandy, What are those sleeps for and do they still exist? We have seen about a 1min to 1:30 executor startup time, which is a large chunk for jobs that run in ~10min. Thanks, Arun On Fri, Dec 5, 2014 at 3:20 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Denny, Those sleeps were only

Re: spark-submit on YARN is slow

2014-12-05 Thread Ashish Rangole
Likely this not the case here yet one thing to point out with Yarn parameters like --num-executors is that they should be specified *before* app jar and app args on spark-submit command line otherwise the app only gets the default number of containers which is 2. On Dec 5, 2014 12:22 PM, Sandy

Re: spark-submit on YARN is slow

2014-12-05 Thread Sandy Ryza
Hey Arun, The sleeps would only cause maximum like 5 second overhead. The idea was to give executors some time to register. On more recent versions, they were replaced with the spark.scheduler.minRegisteredResourcesRatio and spark.scheduler.maxRegisteredResourcesWaitingTime. As of 1.1, by

Re: Java RDD Union

2014-12-05 Thread Sean Owen
foreach also creates a new RDD, and does not modify an existing RDD. However, in practice, nothing stops you from fiddling with the Java objects inside an RDD when you get a reference to them in a method like this. This is definitely a bad idea, as there is certainly no guarantee that any other

R: Optimized spark configuration

2014-12-05 Thread Paolo Platter
What kind of Query are you performing? You should set something like 2 partition per core that would be 400 Mb per partition. As you have a lot of ram I suggest to cache the whole table, performance will increase a lot. Paolo Inviata dal mio Windows Phone Da:

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Michael Armbrust
All values in Hive are always nullable, though you should still not be seeing this error. It should be addressed by this patch: https://github.com/apache/spark/pull/3150 On Fri, Dec 5, 2014 at 2:36 AM, Hao Ren inv...@gmail.com wrote: Hi, I am using SparkSQL on 1.1.0 branch. The following

Re: Using data in RDD to specify HDFS directory to write to

2014-12-05 Thread Nathan Murthy
I'm experiencing the same problem when I try to run my app in a standalone Spark cluster. My use case, however, is closer to the problem documented in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Please-help-running-a-standalone-app-on-a-Spark-cluster-td1596.html. The

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread DB Tsai
Also, are you using the latest master in this experiment? A PR merged into the master couple days ago will spend up the k-means three times. See https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 Sincerely, DB Tsai

Re: Including data nucleus tools

2014-12-05 Thread DB Tsai
Can you try to run the same job using the assembly packaged by make-distribution as we discussed in the other thread. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 5, 2014 at

Cannot PredictOnValues or PredictOn base on the model build with StreamingLinearRegressionWithSGD

2014-12-05 Thread Bui, Tri
Hi, The following example code is able to build the correct model.weights, but its prediction value is zero. Am I passing the PredictOnValues incorrectly? I also coded a batch version base on LinearRegressionWithSGD() with the same train and test data, iteration, stepsize info, and it was

Running two different Spark jobs vs multi-threading RDDs

2014-12-05 Thread Corey Nolet
I've read in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle propagating these down to the tasks. I have 3 RDDs: - one RDD which loads some initial data, transforms it and caches it - two RDDs which use the cached

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Marcelo Vanzin
You can set SPARK_PREPEND_CLASSES=1 and it should pick your new mllib classes whenever you compile them. I don't see anything similar for examples/, so if you modify example code you need to re-build the examples module (package or install - just compile won't work, since you need to build the

Re: Market Basket Analysis

2014-12-05 Thread Debasish Das
Apriori can be thought as a post-processing on product similarity graph...I call it product similarity but for each product you build a node which keeps distinct users visiting the product and two product nodes are connected by an edge if the intersection 0...you are assuming if no one user

Transfer from RDD to JavaRDD

2014-12-05 Thread Xingwei Yang
I use Spark in Java. I want to access the vectors of RowMatrix M, thus I use M.rows(), which is a RDDVector I want to transform it to JavaRDDVector, I used the following command; JavaRDDVector data = JavaRDD.fromRDD(M.rows(), scala.reflect.ClassTag$.MODULE$.apply(Vector.class); However, it

Re: Stateful mapPartitions

2014-12-05 Thread Patrick Wendell
Yeah the main way to do this would be to have your own static cache of connections. These could be using an object in Scala or just a static variable in Java (for instance a set of connections that you can borrow from). - Patrick On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Koert Kuipers
i suddenly also run into the issue that maven is trying to download snapshots that dont exists for other sub projects. did something change in the maven build? does maven not have capability to smartly compile the other sub-projects that a sub-project depends on? i rather avoid mvn install

Re: drop table if exists throws exception

2014-12-05 Thread Jianshi Huang
I see. The resulting SchemaRDD is returned so like Michael said, the exception does not propogate to user code. However printing out the following log is confusing :) scala sql(drop table if exists abc) 14/12/05 16:27:02 INFO ParseDriver: Parsing command: drop table if exists abc 14/12/05

Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
Hi, I had to use Pig for some preprocessing and to generate Parquet files for Spark to consume. However, due to Pig's limitation, the generated schema contains Pig's identifier e.g. sorted::id, sorted::cre_ts, ... I tried to put the schema inside CREATE EXTERNAL TABLE, e.g. create external

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Koert Kuipers
i think what changed is that core now has dependencies on other sub projects. ok... so i am forced to install stuff because maven cannot compile what is needed. i will install On Fri, Dec 5, 2014 at 7:12 PM, Koert Kuipers ko...@tresata.com wrote: i suddenly also run into the issue that maven is

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Sean Owen
Maven definitely compiles what is needed, but not if you tell it to only compile one module alone. Unless you have previously built and installed the other local snapshot artifacts it needs, that invocation can't proceed because you have restricted it to build one module whose dependencies don't

Re: Transfer from RDD to JavaRDD

2014-12-05 Thread Sean Owen
You can probably get around it with casting, but I ended up using wrapRDD -- which is not a static method -- from another JavaRDD in scope to address this more directly without casting or warnings. It's not ideal but both should work, just a matter of which you think is less hacky. On Fri, Dec 5,

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Marcelo Vanzin
I've never used it, but reading the help it seems the -am option might help here. On Fri, Dec 5, 2014 at 4:47 PM, Sean Owen so...@cloudera.com wrote: Maven definitely compiles what is needed, but not if you tell it to only compile one module alone. Unless you have previously built and

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Ted Yu
I tried the following: 511 rm -rf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.3.0-SNAPSHOT/ 513 mvn -am -pl streaming package -DskipTests [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [4.976s] [INFO] Spark Project Networking

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
Here's the solution I got after talking with Liancheng: 1) using backquote `..` to wrap up all illegal characters val rdd = parquetFile(file) val schema = rdd.schema.fields.map(f = s`${f.name}` ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n) val ddl_13 = s

Problems creating and reading a large test file

2014-12-05 Thread Steve Lewis
I am trying to look at problems reading a data file over 4G. In my testing I am trying to create such a file. My plan is to create a fasta file (a simple format used in biology) looking like 1 TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTG 2

Re: rdd.saveAsTextFile problem

2014-12-05 Thread dylanhogg
Try the workaround for Windows found here: http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7. This fix the issue when calling rdd.saveAsTextFile(..) for me with Spark v1.1.0 on windows 8.1 in local mode. Summary of steps: 1) download compiled winutils.exe from

Re: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread Shixiong Zhu
There were two exit in this code. If the args was wrong, the spark-submit will get the return code 101, but, if the args is correct, spark-submit cannot get the second return code 100. What’s the difference between these two exit? I was so confused. I’m also confused. When I tried your codes,

Trying to understand a basic difference between these two configurations

2014-12-05 Thread Soumya Simanta
I'm trying to understand the conceptual difference between these two configurations in term of performance (using Spark standalone cluster) Case 1: 1 Node 60 cores 240G of memory 50G of data on local file system Case 2: 6 Nodes 10 cores per node 40G of memory per node 50G of data on HDFS nodes

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-05 Thread Imran Rashid
It's an easy mistake to make... I wonder if an assertion could be implemented that makes sure the type parameter is present. We could use the NotNothing pattern http://blog.evilmonkeylabs.com/2012/05/31/Forcing_Compiler_Nothing_checks/ but I wonder if it would just make the method signature

Re: Trying to understand a basic difference between these two configurations

2014-12-05 Thread Tathagata Das
That depends! See inline. I am assuming that when you said replacing local disk with HDFS in case 1, you are connected to a separate HDFS cluster (like case 1) with a single 10G link. Also assumign that all nodes (1 in case 1, and 6 in case 2) are the worker nodes, and the spark application

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Getting this on the home machine as well. Not referencing the spark cassandra connector in libraryDependencies compiles. I've recently updated IntelliJ to 14. Could that be causing an issue? From: as...@live.com To: yuzhih...@gmail.com CC: user@spark.apache.org Subject: RE: Adding Spark

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
Sorry for the delay in my response - for my spark calls for stand-alone and YARN, I am using the --executor-memory and --total-executor-cores for the submission. In standalone, my baseline query completes in ~40s while in YARN, it completes in ~1800s. It does not appear from the RM web UI that

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
Okay, my bad for not testing out the documented arguments - once i use the correct ones, the query shrinks completes in ~55s (I can probably make it faster). Thanks for the help, eh?! On Fri Dec 05 2014 at 10:34:50 PM Denny Lee denny.g@gmail.com wrote: Sorry for the delay in my response

Fair scheduling accross applications in stand-alone mode

2014-12-05 Thread Mohammed Guller
Hi - I understand that one can use spark.deploy.defaultCores and spark.cores.max to assign a fixed number of worker cores to different apps. However, instead of statically assigning the cores, I would like Spark to dynamically assign the cores to multiple apps. For example, when there is a