Spark streaming update/restart gracefully

2014-10-27 Thread Jianshi Huang
Looks like currently solution to update spark-stream jars/configurations is to 1) save current Kafka offsets somewhere (say zookeeper) 2) shutdown the cluster and restart 3) connect to Kafka with previously saved offset Assuming we're reading from Kafka which provides nice persistence and

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
Hi Jianshi, For simulation purpose, I think you can try ConstantInputDStream and QueueInputDStream to convert one RDD or series of RDD into DStream, the first one output the same RDD in each batch duration, and the second one just output a RDD in a queue in each batch duration. You can take a

Re: Spark SQL Exists Clause

2014-10-27 Thread Nicholas Chammas
I believe that's correct. See: http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features 2014년 10월 27일 월요일, agg212alexander_galaka...@brown.edu님이 작성한 메시지: Hey, I'm trying to run TPC-H Query 4 (shown below), and get the following error: Exception in thread main

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
Hi Saisai, I understand it's non-trivial, but the requirement of simulating offline data as stream is also fair. :) I just wrote a prototype, however, I need to do a collect and a bunch of parallelize... // RDD of (timestamp, value) def rddToDStream[T: ClassTag](data: RDD[(Long, T)],

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
I think you solution may not be extendable if the data size is increasing, since you have to collect all your data back to driver node, so the memory usage of driver will be a problem. why not filter out specific time-range data as a rdd, after filtering the whole time-range, you will get a

Re: Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-10-27 Thread Akhil Das
It works fine on my *Spark 1.1.0* Thanks Best Regards On Mon, Oct 27, 2014 at 12:22 AM, octavian.ganea octavian.ga...@inf.ethz.ch wrote: Hi Akhil, Please see this related message. http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-td17263.html I am curious if this

Re: Spark optimization

2014-10-27 Thread Akhil Das
There is no tool to tweak a spark cluster, but while writing the job, you can consider the Tuning guidelines http://spark.apache.org/docs/latest/tuning.html. Thanks Best Regards On Mon, Oct 27, 2014 at 3:14 AM, Morbious knowledgefromgro...@gmail.com wrote: I wonder if there is any tool to

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
You're absolutely right, it's not 'scalable' as I'm using collect(). However, it's important to have the RDDs ordered by the timestamp of the time window (groupBy puts data to corresponding timewindow). It's fairly easy to do in Pig, but somehow I have no idea how to express it in RDD...

Re: Spark SQL configuration

2014-10-27 Thread Akhil Das
You will face problems if the spark version isn't compatible with your hadoop version. (Lets say you have hadoop 2.x and you downloaded spark pre-compiled with hadoop 1.x then it would be a problem.) Of course you can use spark without telling about any hadoop configurations unless you are trying

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
Ok, back to Scala code, I'm wondering why I cannot do this: data.groupBy(timestamp / window) .sortByKey() // no sort method available here .map(sc.parallelize(_._2.sortBy(_._1))) // nested RDD, hmm... .collect() // returns Seq[RDD[(Timestamp, T)]] Jianshi On Mon, Oct 27, 2014 at 3:24

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
I think what you want is to make each bucket as a new RDD as what you mentioned in Pig syntax. gs = ORDER g BY group ASC, g.timestamp ASC // 'group' is the rounded timestamp for each bucket From my understanding, currently in Spark there’s no such kind of API to achieve this, maybe you have

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
Yeah, you're absolutely right Saisai. My point is we should allow this kind of logic in RDD, let's say transforming type RDD[(Key, Iterable[T])] to Seq[(Key, RDD[T])]. Make sense? Jianshi On Mon, Oct 27, 2014 at 3:56 PM, Shao, Saisai saisai.s...@intel.com wrote: I think what you want is to

Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Jianshi Huang
There's an annoying small usability issue in HiveContext. By default, it creates a local metastore which forbids other processes using HiveContext to be launched from the same directory. How can I make the metastore local to each HiveContext? Is there an in-memory metastore configuration?

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
Yes, I understand what you want, but maybe hard to achieve without collecting back to driver node. Besides, can we just think of another way to do it. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 4:07 PM To: Shao, Saisai Cc:

Re: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
Would you mind to share DDLs of all involved tables? What format are these tables stored in? Is this issue specific to this query? I guess Hive, Shark and Spark SQL all read from the same HDFS dataset? On 10/27/14 3:45 PM, lyf刘钰帆 wrote: Hi, I am using SparkSQL 1.1.0 with cdh 4.6.0

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
Sure, let's still focus on the streaming simulation use case. It's a very useful problem to solve. If we're going to use the same Spark-streaming core for the simulation, the most simple way is to have a globally sorted RDDs and use ssc.queueStream. Thus collecting the Key part to driver is

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-27 Thread guoxu1231
Hi Stephen, I tried it again, To avoid the profile impact, I execute mvn -DskipTests clean package with Hadoop 1.0.4 by default and open the IDEA and import it as a maven project, and I didn't choose any profile in the import wizard. Then Make project or re-build project in IDEA, unfortunately

Re: Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-27 Thread Jianshi Huang
Any suggestion? :) Jianshi On Thu, Oct 23, 2014 at 3:49 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s per topic). Which configuration do you recommend? - 1 Spark app consuming all Kafka topics - 10 separate Spark

Re: Spark can't find jars

2014-10-27 Thread twinkle sachdeva
Hi, Try running following in the spark folder: bin/*run-example *SparkPi 10 If this runs fine, just see the set of arguments being passed via this script, and try in similar way. Thanks, On Thu, Oct 16, 2014 at 2:59 PM, Christophe Préaud christophe.pre...@kelkoo.com wrote: Hi, I have

Sort-based shuffle did not work as expected

2014-10-27 Thread su...@certusnet.com.cn
Hi, all We would expect to utilize sort-based shuffle in our spark application but had encounted unhandled problems. It seems that data file and index file are not in consistence state and we got wrong result sets when trying to use spark to bulk load data into hbase. There are many

RE: Sort-based shuffle did not work as expected

2014-10-27 Thread Shao, Saisai
Hi, Probably the problem you met is related to this JIRA ticket (https://issues.apache.org/jira/browse/SPARK-3948). It's potentially a Kernel 2.6.32 bug which will make sort-based shuffle failed. I'm not sure your problem is the same as this one, would you mind checking your kernel version?

NoSuchMethodError: cassandra.thrift.ITransportFactory.openTransport()

2014-10-27 Thread Sasi
Apache Spark Team, We recently started experimenting with Apache Spark for high speed data retrieval. We downloaded Apache Spark Source Code; Installed Git; Packaged the assebly; Installed Scala and ran some examples mentioned in the documentation. We did all these steps on WindowsXP. Till this

Re: 答复: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
Hm, these DDLs look quite normal. Forgot to ask what version of Spark SQL are you using? Can you reproduce this issue with the master branch? Also, a small sample of input data that can reproduce this issue can be very helpful. For example, you can run SELECT * FROM tbl TABLESAMPLE (10 ROWS)

Re: PySpark problem with textblob from NLTK used in map

2014-10-27 Thread jan.zikes
So the problem was that Spark has internaly set home to /home. Hack to make this work with Python is to add before call of textblob line: os.environ['HOME'] = '/home/hadoop'  __ Maybe I'll add one more question. I think that the

Fwd: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin
Hi Spark Users, I am trying to use the Akka Camel library together with Spark Streaming and it is not working when I deploy my Spark application to the Spark Cluster. It does work when I run the application locally so this seems to be an issue with how Spark loads the reference.conf file from the

Re: Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-27 Thread Alec Ten Harmsel
On 10/27/2014 05:19 AM, Jianshi Huang wrote: Any suggestion? :) Jianshi On Thu, Oct 23, 2014 at 3:49 PM, Jianshi Huang jianshi.hu...@gmail.com mailto:jianshi.hu...@gmail.com wrote: The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s per topic). Which

Re: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin
Apologies, was not finished typing. 14/10/23 09:31:30 ERROR OneForOneStrategy: No configuration setting found for key 'akka.camel' akka.actor.ActorInitializationException: exception during creation This error means that the Akka Camel reference.conf is not being loaded, even though it is in the

RE: Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-27 Thread Ashic Mahtab
I'm quite interested in this as well. I remember something about a streaming context needing one core. If that's the case, then won't 10 apps require 10 cores? Seems like a waste unless each topic is quite resource hungry? Would love to hear from the experts :) Date: Mon, 27 Oct 2014 06:35:29

Re: Python code crashing on ReduceByKey if I return custom class object

2014-10-27 Thread Gen
Hi, I think the problem is caused by Data Serialization. You can follow the link https://spark.apache.org/docs/latest/tuning.html https://spark.apache.org/docs/latest/tuning.html to register your class testing. For pyspark 1.1.0, there is a problem about the default serializer.

Missing java.util.Date class error while running Spark job

2014-10-27 Thread Saket Kumar
Hello all, I am trying to run a Spark job but while running the job I am getting missing class errors. First I got a missing class error for Joda DateTime, I added the dependency in my build.sbt and rebuilt the assembly jar and tried again. This time I am getting the same error for

Re: spark sql query optimization , and decision tree building

2014-10-27 Thread Yanbo Liang
If you want to calculate mean, variance, minimum, maximum and total count for each columns, especially for features of machine learning, you can try MultivariateOnlineSummarizer. MultivariateOnlineSummarizer implements a numerically stable algorithm to compute sample mean and variance by column in

Re: NoSuchMethodError: cassandra.thrift.ITransportFactory.openTransport()

2014-10-27 Thread Akhil Das
This can happen due to jar conflict, meaning one of the jar (possibly the first one apache-cassandra-thrift*jar) is having a class named *ITransportFactory *but it doesn't contain the method *openTransport()* in it. You can make sure the jar is having *ITransportFactory *class in it and it also

OutOfMemory in cogroup

2014-10-27 Thread Shixiong Zhu
We encountered some special OOM cases of cogroup when the data in one partition is not balanced. 1. The estimated size of used memory is inaccurate. For example, there are too many values for some special keys. Because SizeEstimator.visitArray only samples at most 100 cells for an array, it may

How to avoid use snappy compression when saveAsSequenceFile?

2014-10-27 Thread buring
Hi: After update spark to version1.1.0, I experienced a snappy error which was posted here http://apache-spark-user-list.1001560.n3.nabble.com/Update-gcc-version-Still-snappy-error-tt15137.html . I avoid this problem with

Re: NoSuchMethodError: cassandra.thrift.ITransportFactory.openTransport()

2014-10-27 Thread Helena Edelson
Hi Sasi, Thrift is not needed to integrate Cassandra with Spark. In fact the only dep you need is spark-cassandra-connector_2.10-1.1.0-alpha3.jar, and you can upgrade to alpha4; we’re publishing beta very soon. For future reference, questions/tickets can be created

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Cheng Lian
I have never tried this yet, but maybe you can use an in-memory Derby database as metastore https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html I'll investigate this when free, guess we can use this for Spark SQL Hive support testing. On 10/27/14 4:38 PM, Jianshi Huang

Re: Spark as Relational Database

2014-10-27 Thread Peter Wolf
I agree. I'd like to avoid SQL If I could store everything in Cassandra or Mongo and process in Spark, that would be far preferable to creating a temporary Working Set. I'd like to write a performance test. Lets say I have two large collections A and B. Each collection has 2 columns and many

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Ted Yu
Please see https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-EmbeddedMetastore Cheers On Oct 27, 2014, at 6:20 AM, Cheng Lian lian.cs@gmail.com wrote: I have never tried this yet, but maybe you can use an in-memory Derby database as

Unsubscribe

2014-10-27 Thread Ian Ferreira
unsubscribe

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Cheng Lian
Thanks Ted, this is exactly what Spark SQL LocalHiveContext does. To make an embedded metastore local to a single HiveContext, we must allocate different Derby database directories for each HiveContext, and Jianshi is also trying to avoid that. On 10/27/14 9:44 PM, Ted Yu wrote: Please see

Re: Unsubscribe

2014-10-27 Thread Ted Yu
Take a look at the first section of: http://spark.apache.org/community Cheers On Mon, Oct 27, 2014 at 6:50 AM, Ian Ferreira ianferre...@hotmail.com wrote: unsubscribe

Re: Unsubscribe

2014-10-27 Thread Akhil Das
Send an email to user-unsubscr...@spark.apache.org to unsubscribe. Thanks Best Regards On Mon, Oct 27, 2014 at 7:41 PM, Ted Yu yuzhih...@gmail.com wrote: Take a look at the first section of: http://spark.apache.org/community Cheers On Mon, Oct 27, 2014 at 6:50 AM, Ian Ferreira

How to set JAVA_HOME with --deploy-mode cluster

2014-10-27 Thread Thomas Risberg
Hi, I'm trying to run the SparkPi example using `--deploy-mode cluster`. Not having much luck. Looks like the JAVA_HOME on my Mac is used for launching the app on the Spark cluster running on CentOS. How can I set the JAVA_HOME or better yet, how can I configure Spark to use the JAVA_HOME

Re: How to set JAVA_HOME with --deploy-mode cluster

2014-10-27 Thread Akhil Das
You can do *which java* command and see where exactly is your java being installed and export it as the JAVA_HOME. [image: Inline image 1] Thanks Best Regards On Mon, Oct 27, 2014 at 7:56 PM, Thomas Risberg trisb...@pivotal.io wrote: Hi, I'm trying to run the SparkPi example using

Re: Spark LIBLINEAR

2014-10-27 Thread Debasish Das
Hi Professor Lin, It will be great if you could please review the TRON code in breeze and whether it is similar to the original TRON implementation...Breeze is already integrated in mllib (we are using BFGS and OWLQN is under works for mllib LogisticRegression) and comparing with TRON should be

Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread nitinkak001
I am working on running the following hive query from spark. /SELECT * FROM spark_poc.table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID/ Ran into /java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;/ (complete

Problem to run spark as standalone

2014-10-27 Thread java8964
Hi, Spark Users: I tried to test the spark in a standalone box, but faced an issue which I don't know what is the root cause. I basically followed exactly document of deploy spark in a standalone environment. 1) I check out spark source code of release 1.1.02) I build the spark with following

What this exception means? ConnectionManager: key already cancelled ?

2014-10-27 Thread shahab
Hi, I have a stand alone Spark Cluster, where worker and master reside on the same machine. I submit a job to the cluster, the job is executed for a while and suddenly I get this exception with no additional trace. ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@2490dce9

deploying a model built in mllib

2014-10-27 Thread chirag lakhani
Hello, I have been prototyping a text classification model that my company would like to eventually put into production. Our technology stack is currently Java based but we would like to be able to build our models in Spark/MLlib and then export something like a PMML file which can be used for

RE: Problem to run spark as standalone

2014-10-27 Thread java8964
I did a little more research about this. It looks like the worker started successfully, but on port 40294. This is shown in both log and master web UI. The question is that in the log, the master akka.tcp is trying to connect to another different port (44017). Why? Yong From:

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Jianshi Huang
Thanks Ted and Cheng for the in memory derby solution. I'll check it out. :) And to me, using in-mem by default makes sense, if user wants a shared metastore, it needs to be specified. An 'embedded' local metastore in the working directory barely has a use case. Jianshi On Mon, Oct 27, 2014

Re: What this exception means? ConnectionManager: key already cancelled ?

2014-10-27 Thread Akhil Das
Setting the following while creating the sparkContext will sort it out. .set(spark.core.connection.ack.wait.timeout,600) .set(spark.akka.frameSize,50) On 27 Oct 2014 21:15, shahab shahab.mok...@gmail.com wrote: Hi, I have a stand alone Spark Cluster, where worker and master

Re: OutOfMemory in cogroup

2014-10-27 Thread Holden Karau
On Monday, October 27, 2014, Shixiong Zhu zsxw...@gmail.com wrote: We encountered some special OOM cases of cogroup when the data in one partition is not balanced. 1. The estimated size of used memory is inaccurate. For example, there are too many values for some special keys. Because

Re: Bug in Accumulators...

2014-10-27 Thread octavian.ganea
I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works if I run it in local mode! ) If I put the accumulator inside the for loop, everything will work fine. I guess the bug is that an accumulator can be applied to JUST one RDD. Still another undocumented 'feature' of Spark

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-27 Thread Holden Karau
Can you post the error message you get when trying to save the sequence file? If you call first() on the RDD does it result in the same error? On Mon, Oct 27, 2014 at 6:13 AM, buring qyqb...@gmail.com wrote: Hi: After update spark to version1.1.0, I experienced a snappy error which

Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-27 Thread Marius Soutier
So, apparently `wholeTextFiles` runs the job again, passing null as argument list, which in turn blows up my argument parsing mechanics. I never thought I had to check for null again in a pure Scala environment ;) On 26.10.2014, at 11:57, Marius Soutier mps@gmail.com wrote: I tried that

Re: deploying a model built in mllib

2014-10-27 Thread Xiangrui Meng
We are working on the pipeline features, which would make this procedure much easier in MLlib. This is still a WIP and the main JIRA is at: https://issues.apache.org/jira/browse/SPARK-1856 Best, Xiangrui On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani chirag.lakh...@gmail.com wrote: Hello, I

Measuring Performance in Spark

2014-10-27 Thread mahsa
Hi, I want to test the performance of MapReduce, and Spark on a program, find the bottleneck, calculating the performance of each part of the program and etc. I was wondering if there is tool for the measurement like Galia and etc. to help me in this regard. Thanks! -- View this message in

Re: Missing java.util.Date class error while running Spark job

2014-10-27 Thread Saket Kumar
After trying around with few permutations I was able to get rid of this error by changing the datastax-cassandra-connector version to 1.1.0.alpha. Could it be related to a bug in alpha4? But why would it manifest itself as a class loading error? On 27 Oct 2014, at 12:10, Saket Kumar

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Michael Armbrust
No such method error almost always means you are mixing different versions of the same library on the classpath. In this case it looks like you have more than one version of guava. Have you added anything to the classpath? On Mon, Oct 27, 2014 at 8:36 AM, nitinkak001 nitinkak...@gmail.com

Re: Spark as Relational Database

2014-10-27 Thread Michael Armbrust
I'd suggest checking out the Spark SQL programming guide to answer this type of query: http://spark.apache.org/docs/latest/sql-programming-guide.html You could also perform it using the raw Spark RDD API http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.rdd.RDD, but its

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Nitin kak
Yes, I added all the Hive jars present in Cloudera distribution of Hadoop. I added them because I was getting ClassNotFoundException for many required classes(one example stack trace below). So, someone on the community suggested to include the hive jars: *Exception in thread main

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Michael Armbrust
Which version of CDH are you using? I believe that hive is not correctly packaged in 5.1, but should work in 5.2. Another option that people use is to deploy the plain Apache version of Spark on CDH Yarn. On Mon, Oct 27, 2014 at 11:10 AM, Nitin kak nitinkak...@gmail.com wrote: Yes, I added

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Nitin kak
I am now on CDH 5.2 which has the Hive module packaged in it. On Mon, Oct 27, 2014 at 2:17 PM, Michael Armbrust mich...@databricks.com wrote: Which version of CDH are you using? I believe that hive is not correctly packaged in 5.1, but should work in 5.2. Another option that people use is

Java api overhead?

2014-10-27 Thread Sonal Goyal
Hi, I wanted to understand what kind of memory overheads are expected if at all while using the Java API. My application seems to have a lot of live Tuple2 instances and I am hitting a lot of gc so I am wondering if I am doing something fundamentally wrong. Here is what the top of my heap looks

MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Ilya Ganelin
Hello all - I am attempting to run MLLib's ALS algorithm on a substantial test vector - approx. 200 million records. I have resolved a few issues I've had with regards to garbage collection, KryoSeralization, and memory usage. I have not been able to get around this issue I see below however:

Re: Spark as Relational Database

2014-10-27 Thread Peter Wolf
Great. Thank you very much Michael :-D On Mon, Oct 27, 2014 at 2:03 PM, Michael Armbrust mich...@databricks.com wrote: I'd suggest checking out the Spark SQL programming guide to answer this type of query: http://spark.apache.org/docs/latest/sql-programming-guide.html You could also

Re: Python code crashing on ReduceByKey if I return custom class object

2014-10-27 Thread Davies Liu
This is known issue with PySpark, the class and objects of custom class in current script can not serialized by pickle between driver and worker You can workaround this by put 'testing' in a module, and sending this module to cluster by `sc.addPyFile` Davies On Sun, Oct 26, 2014 at 11:57 PM,

using existing hive with spark sql

2014-10-27 Thread Pagliari, Roberto
If I already have hive running on Hadoop, do I need to build Hive using sbt/sbt -Phive assembly/assembly command? If the answer is no, how do I tell spark where hive home is? Thanks,

Spark Shell strange worker Exception

2014-10-27 Thread Paolo Platter
Hi all, I’m submitting a simple task using the spark shell against a cassandraRDD ( Datastax Environment ). I’m getting the following eception from one of the workers: INFO 2014-10-27 14:08:03 akka.event.slf4j.Slf4jLogger: Slf4jLogger started INFO 2014-10-27 14:08:03 Remoting: Starting

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Burak Yavuz
Hi, I've come across this multiple times, but not in a consistent manner. I found it hard to reproduce. I have a jira for it: SPARK-3080 Do you observe this error every single time? Where do you load your data from? Which version of Spark are you running? Figuring out the similarities may

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Ilya Ganelin
Hi Burak. I always see this error. I'm running the CDH 5.2 version of Spark 1.1.0. I load my data from HDFS. By the time it hits the recommender it had gone through many spark operations. On Oct 27, 2014 4:03 PM, Burak Yavuz bya...@stanford.edu wrote: Hi, I've come across this multiple times,

exact count using rdd.count()?

2014-10-27 Thread Josh J
Hi, Is the following guaranteed to always provide an exact count? foreachRDD(foreachFunc = rdd = { rdd.count() In the literature it mentions However, output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more

Re: exact count using rdd.count()?

2014-10-27 Thread Holden Karau
Hi Josh, The count() call will result in the correct number in each RDD, however foreachRDD doesn't return the result of its computation anywhere (its intended for things which cause side effects, like updating an accumulator or causing an web request), you might want to look at transform or the

Upcoming Scala DevFlow Training By the Bay, November 6-7

2014-10-27 Thread Alexy Khrabrov
Spark is written in Scala. As a strongly typed functional programming language, Scala makes Spark possible. The API of Spark mirrors Scala collections API, driving adoption through beautifully intuitive data transformations which work both locally and at scale. And where's more inspiration

Spark to eliminate full-table scan latency

2014-10-27 Thread Ron Ayoub
We have a table containing 25 features per item id along with feature weights. A correlation matrix can be constructed for every feature pair based on co-occurrence. If a user inputs a feature they can find out the features that are correlated with a self-join requiring a single full table

Scala Spark IDE help

2014-10-27 Thread Eric Tanner
I am a Scala / Spark newbie (attending Paco Nathan's class). What I need is some advice as to how to set up intellij (or eclipse) to be able to attache to the process executing to the debugger. I know that this is not feasible if the code is executing within the cluster. However, if spark is

Re: Submitting Spark job on cluster from dev environment

2014-10-27 Thread Shailesh Birari
Hello, I am able to submit Job on Spark cluster from Windows desktop. But the executors are not able to run. When I check the Spark UI (which is on Windows, as Driver is there) it shows me JAVA_HOME, CLASS_PATH and other environment variables related to Windows. I tried by setting

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Nitin kak
Somehow worked by placing all the jars(except guava) in hive lib after --jars. Had initially tried to place the jars under another temporary folder and pointing the executor and driver extraClassPath to that director, but didnt work. On Mon, Oct 27, 2014 at 2:21 PM, Nitin kak

Re: Spark to eliminate full-table scan latency

2014-10-27 Thread Michael Armbrust
You can access cached data in spark through the JDBC server: http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub ronalday...@live.com wrote: We have a table containing 25 features per item id along with

Subquery in having-clause (Spark 1.1.0)

2014-10-27 Thread Daniel Klinger
Hello, I got the following query: SELECT id, Count(*) AS amount FROM table1 GROUP BY id HAVING amount = (SELECT Max(mamount) FROM (SELECT id, Count(*) AS mamount FROM table1

[SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-27 Thread Du Li
Hi, I was trying to set up Spark SQL on a private cluster. I configured a hive-site.xml under spark/conf that uses a local metestore with warehouse and default FS name set to HDFS on one of my corporate cluster. Then I started spark master, worker and thrift server. However, when creating a

Cant start spark-shell in CDH Spark Standalone 1.1.0+cdh5.2.0+56

2014-10-27 Thread Sanjay Subramanian
hey guys Anyone using CDH Spark StandaloneI installed Spark standalone thru Cloudera Manager $ spark-shell --total-executor-cores 8 /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/bin/../lib/spark/bin/spark-shell: line 44:

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Xiangrui Meng
Could you save the data before ALS and try to reproduce the problem? You might try reducing the number of partitions and not using Kryo serialization, just to narrow down the issue. -Xiangrui On Mon, Oct 27, 2014 at 1:29 PM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Burak. I always see this

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Ganelin, Ilya
Hi Xiangrui - I can certainly save the data before ALS - that would be a great first step. Why would reducing the number of partitions help? I would very much like to understand what¹s happening internally. Also, with regards to Burak¹s earlier comment, here is the JIRA referencing this problem.

Re: Cant start spark-shell in CDH Spark Standalone 1.1.0+cdh5.2.0+56

2014-10-27 Thread Sean Owen
It works fine for me on 5.2.0. But I think you are hitting: https://issues.cloudera.org/browse/DISTRO-649 I am not sure what circumstances cause it to not be included, since it works on my cluster, but this is apparently fixed in 5.2.1 where it is a problem. I assume that follows quite shortly

Re: Subquery in having-clause (Spark 1.1.0)

2014-10-27 Thread Daniel Klinger
So it dosen't matter which dialect im using? Caus i set spark.sql.dialect to sql. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-tp17401p17408.html Sent from the Apache Spark User List mailing list archive at

Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-27 Thread nitinkak001
Would the rdd resulting from the below query be partitioned on GEO_REGION, GEO_COUNTRY? I ran some tests(using mapPartitions on the resulting RDD) and seems that there are always 50 partitions generated while there should be around 1000. /SELECT * FROM spark_poc.table_nameDISTRIBUTE BY

Batch of updates

2014-10-27 Thread Flavio Pompermaier
Hi to all, I'm trying to convert my old mapreduce job to a spark one but I have some doubts.. My application basically buffers a batch of updates and every 100 elements it flushes the batch to a server. This is very easy in mapreduce but I don't know how you can do that in scala.. For example, if

Re: Subquery in having-clause (Spark 1.1.0)

2014-10-27 Thread Michael Armbrust
Yeah, sorry for being unclear. Subquery expressions are not supported. That particular error was coming from the Hive parser. On Mon, Oct 27, 2014 at 4:03 PM, Daniel Klinger d...@web-computing.de wrote: So it dosen't matter which dialect im using? Caus i set spark.sql.dialect to sql. --

Meaning of persistence levels -- setting persistence causing out of memory errors with pyspark

2014-10-27 Thread Eric Jonas
I'm running spark locally on my laptop to explore how persistence impacts memory use. I'm generating 80 MB matrices in numpy and then simply adding them as an example problem. No matter what I set NUM or persistence level to in the code below, I get out of memory errors like (

combine rdds?

2014-10-27 Thread Josh J
Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD is not above some threshold. Thanks, Josh

Fixed Sized Strings in Spark SQL

2014-10-27 Thread agg212
Hi, I was wondering how to implement fixed sized strings in Spark SQL. I would like to implement TPC-H, which uses fixed sized strings for certain fields (i.e., 15 character L_SHIPMODE field). Is there a way to use a fixed length char array instead of using a string? -- View this message in

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-27 Thread Localhost shell
Hey lordjoe, Apologies for the late reply. I followed your threadlocal approach and it worked fine. I will update the thread if I get to know more on this. (Don't know how Spark Scala does it but what I wanted to achieve in java is quiet common in many spark-scala github gists) Thanks. On

Re: combine rdds?

2014-10-27 Thread Koert Kuipers
this requires evaluation of the rdd to do the count. val x: RDD[X] = ... val y: RDD[X] = ... x.cache val z = if(x.count thres) x.union(y) else x On Oct 27, 2014 7:51 PM, Josh J joshjd...@gmail.com wrote: Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD is

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-27 Thread Yana Kadiyska
guoxu1231, I struggled with the Idea problem for a full week. Same thing -- clean builds under MVN/Sbt, but no luck with IDEA. What worked for me was the solution posted higher up in this thread -- it's a SO post that basically says to delete all iml files anywhere under the project directory.

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-27 Thread Soumya Simanta
You need to change the Scala compiler from IntelliJ to “sbt incremental compiler” (see the screenshot below). You can access this by going to “preferences” ­ “scala”. NOTE: This is supported only for certain version of IntelliJ scala plugin. See this link for details.

RE: Spark to eliminate full-table scan latency

2014-10-27 Thread Ron Ayoub
This does look like it provides a good way to allow other process to access the contents of an RDD in a separate app? Is there any other general purpose mechanism for serving up RDD data? I understand that the driver app and workers all are app specific and run in separate executors but would

Re: OutOfMemory in cogroup

2014-10-27 Thread Shixiong Zhu
You can change spark.shuffle.safetyFraction , but that is a really big margin to add. The problem is that the estimated size of used memory is inaccurate. I dig into the codes and found that SizeEstimator.visitArray randomly selects 100 cell and use them to estimate the memory size of the whole

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-27 Thread buring
Here is error log,I abstract as follows: INFO [binaryTest---main]: before first WARN [org.apache.spark.scheduler.TaskSetManager---Result resolver thread-0]: Lost task 0.0 in stage 0.0 (TID 0, spark-dev136): org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null

  1   2   >