回复: spark broadcast unavailable

2014-12-10 Thread 十六夜涙
Hi All, I'v read official docs of tachyon,It seems not fit my usage,For my understanding,‍It just cache files in memory,but I have a file contains over million lines amount about 70mb,retrieveing data and mapping to a Map varible will costs over serveral minuts,which I dont want to process it

Re: Mllib error

2014-12-10 Thread Ritesh Kumar Singh
How did you build your spark 1.1.1 ? On Wed, Dec 10, 2014 at 10:41 AM, amin mohebbi aminn_...@yahoo.com.invalid wrote: I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program: Object mllib is not a member of

Re: KafkaUtils explicit acks

2014-12-10 Thread Mukesh Jha
Hello Guys, Any insights on this?? If I'm not clear enough my question is how can I use kafka consumer and not loose any data in cases of failures with spark-streaming. On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha me.mukesh@gmail.com wrote: Hello Experts, I'm working on a spark app which

Re: Mllib native netlib-java/OpenBLAS

2014-12-10 Thread Guillaume Pitel
Hi, I had the same problem, and tried to compile with mvn -Pnetlib-lgpl $ mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package Unfortunately, the resulting assembly jar still lacked the netlib-system class. This command : $ jar tvf

Actor System Corrupted!

2014-12-10 Thread Stephen Samuel (Sam)
Hi all, Having a strange issue that I can't find any previous issues for on the mailing list or stack overflow. Frequently we are getting ACTOR SYSTEM CORRUPTED!! A Dispatcher can't have less than 0 inhabitants! with a stack trace, from akka, in the executor logs, and the executor is marked as

MLLib in Production

2014-12-10 Thread Klausen Schaefersinho
Hi, I would like to use Spark to train a model, but use the model in some other place,, e.g. a servelt to do some classification in real time. What is the best way to do this? Can I just copy I model file or something and load it in the servelt? Can anybody point me to a good tutorial?

Re: MLLib in Production

2014-12-10 Thread Simon Chan
Hi Klaus, PredictionIO is an open source product based on Spark MLlib for exactly this purpose. This is the tutorial for classification in particular: http://docs.prediction.io/classification/quickstart/ You can add custom serving logics and retrieve prediction result through REST API/SDKs at

Re: KafkaUtils explicit acks

2014-12-10 Thread francois . garillot
Hi Mukesh, There’s been some great work on Spark Streaming reliability lately I’m not aware of any doc yet (did I miss something ?) but you can look at the ReliableKafkaReceiver’s test suite: — FG On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hello

Re: KafkaUtils explicit acks

2014-12-10 Thread francois . garillot
[sorry for the botched half-message] Hi Mukesh, There’s been some great work on Spark Streaming reliability lately. https://www.youtube.com/watch?v=jcJq3ZalXD8 Look at the links from: https://issues.apache.org/jira/browse/SPARK-3129 I’m not aware of any doc yet (did I miss

Maven profile in MLLib netlib-lgpl not working (1.1.1)

2014-12-10 Thread Guillaume Pitel
Hi Issue created https://issues.apache.org/jira/browse/SPARK-4816 Probably a maven-related question for profiles in child modules I couldn't find a clean solution, just a workaround : modify pom.xml in mllib module to force activation of netlib-lgpl module. Hope a maven expert will help.

Re: MLLib in Production

2014-12-10 Thread Yanbo Liang
Hi Klaus, There is no ideal method but some workaround. Train model in Spark cluster or YARN cluster, then use RDD.saveAsTextFile to store this model which include weights and intercept to HDFS. Load weights file and intercept file from HDFS, construct a GLM model, and then run model.predict()

flatMap and spilling of output to disk

2014-12-10 Thread Johannes Simon
Hi! I have been using spark a lot recently and it's been running really well and fast, but now when I increase the data size, it's starting to run into problems: I have an RDD in the form of (String, Iterable[String]) - the Iterable[String] was produced by a groupByKey() - and I perform a

Re: flatMap and spilling of output to disk

2014-12-10 Thread Sean Owen
You are rightly thinking that Spark should be able to just stream this massive collection of pairs you are creating, and never need to put it all in memory. That's true, but, your function actually creates a huge collection of pairs in memory before Spark ever touches it. This is going to

Re: flatMap and spilling of output to disk

2014-12-10 Thread Shixiong Zhu
for(v1 - values; v2 - values) yield ((v1, v2), 1) will generate all data at once and return all of them to flatMap. To solve your problem, you should use for (v1 - values.iterator; v2 - values.iterator) yield ((v1, v2), 1) which will generate the data when it’s necessary. ​ Best Regards,

Re: MLLib in Production

2014-12-10 Thread Sonal Goyal
You can also serialize the model and use it in other places. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Dec 10, 2014 at 5:32 PM, Yanbo Liang yanboha...@gmail.com wrote: Hi Klaus, There is no ideal method but some

Re: flatMap and spilling of output to disk

2014-12-10 Thread Johannes Simon
Hi! Using an iterator solved the problem! I've been chewing on this for days, so thanks a lot to both of you!! :) Since in an earlier version of my code, I used a self-join to perform the same thing, and ran into the same problems, I just looked at the implementation of PairRDDFunction.join

DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Jaonary Rabarisoa
Dear all, I'm trying to understand what is the correct use case of ColumnSimilarity implemented in RowMatrix. As far as I know, this function computes the similarity of a column of a given matrix. The DIMSUM paper says that it's efficient for large m (rows) and small n (columns). In this case

Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Sean Owen
Well, you're computing similarity of your features then. Whether it is meaningful depends a bit on the nature of your features and more on the similarity algorithm. On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm trying to understand what is the

KryoException: Buffer overflow for very small input

2014-12-10 Thread JoeWass
I have narrowed down my problem to some code plus an input file with a single very small input (one line). I'm getting a com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 14634430, but as the input is so small I think there's something else up. I'm not sure what.

how to run spark function in a tomcat servlet

2014-12-10 Thread bai阿蒙
Hi guys: I want to call the RDD api in a servlet which run in a tomcat. So I add the spark-core.jar to the web-inf/lib of the web project .And deploy it to tomcat. but In spark-core.jar there exist the httpserlet which belongs to jetty. then there is some conflict. Can Anybody tell me how to

Re: how to run spark function in a tomcat servlet

2014-12-10 Thread Alonso Isidoro Roman
Hi, i think this post http://stackoverflow.com/questions/2681759/is-there-anyway-to-exclude-artifacts-inherited-from-a-parent-pom shall help you. Alonso Isidoro Roman. Mis citas preferidas (de hoy) : Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el

Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Debasish Das
If you have tall x skinny matrix of m users and n products, column similarity will give you a n x n matrix (product x product matrix)...this is also called product correlation matrix...it can be cosine, pearson or other kind of correlations...Note that if the entry is unobserved (user Joanary did

Spark 1.1.0 does not spawn more than 6 executors in yarn-client mode and ignores --num-executors

2014-12-10 Thread Aniket Bhatnagar
I am running spark 1.1.0 on AWS EMR and I am running a batch job that should seems to be highly parallelizable in yarn-client mode. But spark stop spawning any more executors after spawning 6 executors even though YARN cluster has 15 healthy m1.large nodes. I even tried providing '--num-executors

Re: PhysicalRDD problem?

2014-12-10 Thread Michael Armbrust
I'm hesitant to merge that PR in as it is using a brand new configuration path that is different from the way that the rest of Spark SQL / Spark are configured. I'm suspicious that that hitting max iterations is emblematic of some other issue, as typically resolution happens bottom up, in a

Issue when upgrading from Spark 1.1.0 to 1.1.1: Exception of java.lang.NoClassDefFoundError: io/netty/util/TimerTask

2014-12-10 Thread S. Zhou
Everything worked fine on Spark 1.1.0 until we upgrade to 1.1.1. For some of our unit tests we saw the following exceptions. Any idea how to solve it? Thanks! java.lang.NoClassDefFoundError: io/netty/util/TimerTask        at org.apache.spark.storage.BlockManager.init(BlockManager.scala:72)     

Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Reza Zadeh
As Sean mentioned, you would be computing similar features then. If you want to find similar users, I suggest running k-means with some fixed number of clusters. It's not reasonable to try and compute all pairs of similarities between 1bn items, so k-means with fixed k is more suitable here.

Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...

2014-12-10 Thread Manoj Samel
I am using SQLContext.jsonFile. If a valid JSON contains newlines, spark1.1.1 dumps trace below. If the JSON is read as one line, it works fine. Is this known? 14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 28) com.fasterxml.jackson.core.JsonParseException:

Spark 1.0.0 Standalone mode config

2014-12-10 Thread 9000revs
I am using CDH5.1 and Spark 1.0.0. Trying to configure resources to be allocated to each application. How do I do this? For example, I would each app to use 2 cores and 8G of RAM. I have tried using the pyspark commandline paramaters for --driver-memory, --driver-cores and see no effect of those

Re: Cluster getting a null pointer error

2014-12-10 Thread Yana Kadiyska
does spark-submit with SparkPi and spark-examples.jar work? e.g. ./spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://xx.xx.xx.xx:7077 /path/to/examples.jar On Tue, Dec 9, 2014 at 6:58 PM, Eric Tanner eric.tan...@justenough.com wrote: I have set up a cluster

Re: Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...

2014-12-10 Thread Michael Armbrust
Yep, because sc.textFile will only guarantee that lines will be preserved across splits, this is the semantic. It would be possible to write a custom input format, but that hasn't been done yet. From the documentation: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

MLlib: Libsvm: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-10 Thread Sameer Tilak
Hi All,When I am running LinearRegressionWithSGD, I get the following error. Any help on how to debug this further will be highly appreciated. 14/12/10 20:26:02 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsExceptionjava.lang.ArrayIndexOutOfBoundsException: 150323 at

Key not valid / already cancelled using Spark Streaming

2014-12-10 Thread Flávio Santos
Dear Spark'ers, I'm trying to run a simple job using Spark Streaming (version 1.1.1) and YARN (yarn-cluster), but unfortunately I'm facing many issues. In short, my job does the following: - Consumes a specific Kafka topic - Writes its content to S3 or HDFS Records in Kafka are in the form:

Trouble with cache() and parquet

2014-12-10 Thread Yana Kadiyska
Hi folks, wondering if anyone has thoughts. Trying to create something akin to a materialized view (sqlContext is a HiveContext connected to external metastore): val last2HourRdd = sqlContext.sql(sselect * from mytable) //last2HourRdd.first prints out a org.apache.spark.sql.Row = [...] with

Re: Trouble with cache() and parquet

2014-12-10 Thread Michael Armbrust
Have you checked to make sure the schema in the metastore matches the schema in the parquet file? One way to test would be to just use sqlContext.parquetFile(...) which infers the schema from the file instead of using the metastore. On Wed, Dec 10, 2014 at 12:46 PM, Yana Kadiyska

Re: MLLib in Production

2014-12-10 Thread Ganelin, Ilya
Hi all – I’ve been storing the model userFeatures and productFeatures vectors that are generated internally serialized on disk and importing them as a separate job. From: Sonal Goyal sonalgoy...@gmail.commailto:sonalgoy...@gmail.com Date: Wednesday, December 10, 2014 at 5:31 AM To: Yanbo Liang

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-10 Thread happyyxw
How many working nodes do these 100 executors locate at? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-java-lang-OutOfMemoryError-Java-heap-space-tp20584p20610.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Filtering nested data using Spark SQL

2014-12-10 Thread Jerry Lam
Hi spark users, I'm trying to filter a json file that has the following schema using Spark SQL: root |-- user_id: string (nullable = true) |-- item: array (nullable = true) ||-- element: struct (containsNull = false) |||-- item_id: string (nullable = true) |||-- name:

Unread block issue w/ spark 1.1.0 on CDH5

2014-12-10 Thread Anson Abraham
I recently installed spark standalone through cloudera manager on my CDH 5.2 cluster. CDH 5.2 is runing on CentOS release 6.6. The version of spark again through Cloudera is 1.1. It is standalone. I have a file in hdfs in /tmp/testfile.txt. So what I do is i run spark-shell: scala val source

Using Intellij for pyspark

2014-12-10 Thread Stephen Boesch
Anyone have luck with this? An issue encountered is handling multiple languages - python, java,scala within one module : it is unclear how to select two module SDK's. Both Python and Scala facets were added to the spark-parent module. But when the Project level SDK is not set to Python then the

RDDs being cleaned too fast

2014-12-10 Thread ankits
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast. How can i inspect the size of RDD in memory and get more information about why it was cleaned up. There should be more than enough memory available on the cluster to store them, and by default, the spark.cleaner.ttl is

Partitioner in sortBy

2014-12-10 Thread Kevin Jung
Hi, I'm wondering if I change RangePartitioner in sortBy to another partitioner like HashPartitioner. The first thing that comes into my head is that it can not be replaceable due to RangePartitioner is a part of the sort algorithm. If we call mapPartitions on key based partition after sorting, we

Regarding Classification of Big Data

2014-12-10 Thread Chintan Bhatt
Hi How I can do classification of big data using spark? Which machine algorithm is preferable for that? -- CHINTAN BHATT http://in.linkedin.com/pub/chintan-bhatt/22/b31/336/ Assistant Professor, U P U Patel Department of Computer Engineering, Chandubhai S. Patel Institute of Technology,

Re: Spark 1.0.0 Standalone mode config

2014-12-10 Thread Marcelo Vanzin
Hello, What do you mean by app that uses 2 cores and 8G of RAM? Spark apps generally involve multiple processes. The command line options you used affect only one of them (the driver). You may want to take a look at similar configuration for executors. Also, check the documentation:

Re: flatMap and spilling of output to disk

2014-12-10 Thread Shixiong Zhu
Good catch. `Join` should use `Iterator`, too. I open an JIRA here: https://issues.apache.org/jira/browse/SPARK-4824 Best Regards, Shixiong Zhu 2014-12-10 21:35 GMT+08:00 Johannes Simon johannes.si...@mail.de: Hi! Using an iterator solved the problem! I've been chewing on this for days, so

Re: Error outputing to CSV file

2014-12-10 Thread manasdebashiskar
saveAsSequenceFile method works on rdd. your object csv is a String. If you are using spark-shell you can type your object to know it's datatype. Some prefer eclipse(and it's intelli) to make their live easier. ..Manas - Manas Kar -- View this message in context:

Re: equivalent to sql in

2014-12-10 Thread manasdebashiskar
If you want to take out apple and orange you might want to try dataRDD.filter(_._2 !=apple).filter(_._2 !=orange) and so on. ...Manas - Manas Kar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599p20616.html Sent from the

Re: Saving Data only if Dstream is not empty

2014-12-10 Thread manasdebashiskar
Can you do a countApprox as a condition to check non-empty RDD? ..Manas - Manas Kar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587p20617.html Sent from the Apache Spark User List mailing list archive

Decision Tree with libsvmtools datasets

2014-12-10 Thread Ge, Yao (Y.)
I am testing decision tree using iris.scale data set (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#iris) In the data set there are three class labels 1, 2, and 3. However in the following code, I have to make numClasses = 4. I will get an ArrayIndexOutOfBound Exception

Decision Tree with Categorical Features

2014-12-10 Thread Ge, Yao (Y.)
Can anyone provide an example code of using Categorical Features in Decision Tree? Thanks! -Yao

Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

2014-12-10 Thread Rakesh Nair
Couple of questions : 1. sqlContext.jsonFile reads a json file, infers the schema for the data stored, and then returns a SchemaRDD. Now, i could also create a SchemaRDD by reading a file as text(which returns RDD[String]) and then use the jsonRDD method. My question, is the jsonFile way of

parquet file not loading (spark v 1.1.0)

2014-12-10 Thread Rahul Bindlish
Hi, I have created a parquet-file from case-class using saveAsParquetFile Then try to reload using parquetFile but it fails. Sample code is attached. Any help would be appreciated. Regards, Rahul rahul@... sample_parquet.sample_parquet

Re: RDDs being cleaned too fast

2014-12-10 Thread Aaron Davidson
The ContextCleaner uncaches RDDs that have gone out of scope on the driver. So it's possible that the given RDD is no longer reachable in your program's control flow, or else it'd be a bug in the ContextCleaner. On Wed, Dec 10, 2014 at 5:34 PM, ankits ankitso...@gmail.com wrote: I'm using spark

Re: Issue when upgrading from Spark 1.1.0 to 1.1.1: Exception of java.lang.NoClassDefFoundError: io/netty/util/TimerTask

2014-12-10 Thread Akhil Das
You could try adding netty.io jars http://mvnrepository.com/artifact/io.netty/netty-all/4.0.23.Final in the classpath. Looks like that jar is missing. Thanks Best Regards On Thu, Dec 11, 2014 at 12:15 AM, S. Zhou myx...@yahoo.com.invalid wrote: Everything worked fine on Spark 1.1.0 until we

Re: PySprak and UnsupportedOperationException

2014-12-10 Thread Mohamed Lrhazi
Thanks Davies. it turns out it was indeed and they fixed it in last night's nightly build! https://github.com/elasticsearch/elasticsearch-hadoop/issues/338 On Wed, Dec 10, 2014 at 2:52 AM, Davies Liu dav...@databricks.com wrote: On Tue, Dec 9, 2014 at 11:32 AM, Mohamed Lrhazi

RE: Spark-SQL JDBC driver

2014-12-10 Thread Judy Nash
Looks like you are wondering why you cannot see the RDD table you have created via thrift? Based on my own experience with spark 1.1, RDD created directly via Spark SQL (i.e. Spark Shell or Spark-SQL.sh) is not visible on thrift, since thrift has its own session containing its own RDD. Spark

Error on JavaSparkContext.stop()

2014-12-10 Thread Taeyun Kim
Hi, When my spark program calls JavaSparkContext.stop(), the following errors occur. 14/12/11 16:24:19 INFO Main: sc.stop { 14/12/11 16:24:20 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(cluster02,38918) not found