Re: JavaStreamingContextFactory checkpoint directory NotSerializableException

2014-11-06 Thread Vasu C
Thanks for pointing to the issue. Yes I think its the same issue, below is Exception ERROR OneForOneStrategy: TestCheckpointStreamingJson$1 java.io.NotSerializableException: TestCheckpointStreamingJson at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at

Re: JavaStreamingContextFactory checkpoint directory NotSerializableException

2014-11-06 Thread Sean Owen
No, not the same thing then. This just means you accidentally have a reference to the unserializable enclosing test class in your code. Just make sure the reference is severed. On Thu, Nov 6, 2014 at 8:00 AM, Vasu C vasuc.bigd...@gmail.com wrote: Thanks for pointing to the issue. Yes I think

CheckPoint Issue with JsonRDD

2014-11-06 Thread Jahagirdar, Madhu
When we enable checkpoint and use JsonRDD we get the following error: Is this bug ? Exception in thread main java.lang.NullPointerException at org.apache.spark.rdd.RDD.init(RDD.scala:125) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)

Re: why decision trees do binary split?

2014-11-06 Thread jamborta
I meant above, that in the case of categorical variables it might be more efficient to create a node on each categorical value. Is there a reason why spark went down the binary route? thanks, -- View this message in context:

Re: why decision trees do binary split?

2014-11-06 Thread Sean Owen
I haven't seen that done before, which may be most of the reason - I am not sure that is common practice. I can see upsides - you need not pick candidate splits to test since there is only one N-way rule possible. The binary split equivalent is N levels instead of 1. The big problem is that you

Re: Snappy temp files not cleaned up

2014-11-06 Thread Marius Soutier
Default value is infinite, so you need to enable it. Personally I’ve setup a couple of cron jobs to clean up /tmp and /var/run/spark. On 06.11.2014, at 08:15, Romi Kuntsman r...@totango.com wrote: Hello, Spark has an internal cleanup mechanism (defined by spark.cleaner.ttl, see

Re: why decision trees do binary split?

2014-11-06 Thread Carlos J. Gil Bellosta
Hello, There is a big compelling reason for binary splits in general for trees: a split is made if the difference between the two resulting branches is significant.You also want to compare the significance of this candidate split vs all the other candidate splits. There are many statistical tests

Re: why decision trees do binary split?

2014-11-06 Thread Tamas Jambor
Thanks for the reply, Sean. I can see that splitting on all the categories would probably overfit the tree, on the other hand, it might give more insight on the subcategories (probably only would work if the data is uniformly distributed between the categories). I haven't really found any

Re: I want to make clear the difference about executor-cores number.

2014-11-06 Thread jamborta
the only difference between the two setups (if you vary change the executor cores) is how many tasks are running in parallel (the number of tasks would depend on other factors), so try to inspect the stages while running (probably easier to do that with longer running tasks) by clicking on one of

Re: JavaStreamingContextFactory checkpoint directory NotSerializableException

2014-11-06 Thread Vasu C
HI Sean, Below is my java code and using spark 1.1.0. Still getting the same error. Here Bean class is serialized. Not sure where exactly is the problem. What am I doing wrong here ? public class StreamingJson { public static void main(String[] args) throws Exception { final String HDFS_FILE_LOC

Re: why decision trees do binary split?

2014-11-06 Thread Evan R. Sparks
You can imagine this same logic applying to the continuous case. E.g. what if all the quartiles or deciles of a particular value have different behavior - this could capture that too. Of what if some combination of features was highly discriminitive but only into n buckets, rather than two.. you

Re: JavaStreamingContextFactory checkpoint directory NotSerializableException

2014-11-06 Thread Sean Owen
Erm, you are trying to do all the work in the create() method. This is definitely not what you want to do. It is just supposed to make the JavaSparkStreamingContext. A further problem is that you're using anonymous inner classes, which are non-static and contain a reference to the outer class. The

Re: how to blend a DStream and a broadcast variable?

2014-11-06 Thread Steve Reinhardt
Excellent. Is there an example of this somewhere? Sent from my iPhone On Nov 6, 2014, at 1:43 AM, Sean Owen so...@cloudera.com wrote: Broadcast vars should work fine in Spark streaming. Broadcast vars are immutable however. If you have some info to cache which might change from batch to

multiple spark context in same driver program

2014-11-06 Thread Paweł Szulc
Hi, quick question: I found this: http://docs.sigmoidanalytics.com/index.php/Problems_and_their_Solutions#Multiple_SparkContext:Failed_to_bind_to:.2F127.0.1.1:45916 My main question: is this constrain still valid? AM I not allowed to have two SparkContexts pointing to the same Spark Master in

RE: Any Replicated RDD in Spark?

2014-11-06 Thread Shuai Zheng
Matei, Thanks for reply. I don't worry that much about more code because I migrate from mapreduce, so I have existing code to handle it. But if I want to use a new tech, I will always prefer right way not a temporary easy way!. I will go with RDD first to test the performance. Thanks! Shuai

Task duration graph on Spark stage UI

2014-11-06 Thread Daniel Darabos
Even though the stage UI has min, 25th%, median, 75th%, and max durations, I am often still left clueless about the distribution. For example, 100 out of 200 tasks (started at the same time) have completed in 1 hour. How much longer do I have to wait? I cannot guess well based on the five numbers.

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
Help please! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18280.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: spark sql: join sql fails after sqlCtx.cacheTable()

2014-11-06 Thread Tridib Samanta
I am getting exception at sparksheel at the following line: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) error: bad symbolic reference. A signature in HiveContext.class refers to term hive in package org.apache.hadoop which is not available. It may be completely missing from

Spark and Kafka

2014-11-06 Thread Eduardo Costa Alfaia
Hi Guys, I am doing some tests with Spark Streaming and Kafka, but I have seen something strange, I have modified the JavaKafkaWordCount to use ReducebyKeyandWindow and to print in the screen the accumulated numbers of the words, in the beginning spark works very well in each interaction the

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread Jimmy McErlain
can you be more specific what version of spark, hive, hadoop, etc... what are you trying to do? what are the issues you are seeing? J ᐧ *JIMMY MCERLAIN* DATA SCIENTIST (NERD) *. . . . . . . . . . . . . . . . . .* *IF WE CAN’T DOUBLE YOUR SALES,* *ONE OF US IS IN THE WRONG

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread Terry Siu
What version of Spark are you using? Did you compile your Spark version and if so, what compile options did you use? On 11/6/14, 9:22 AM, tridib tridib.sama...@live.com wrote: Help please! -- View this message in context:

RE: Unable to use HiveContext in spark-shell

2014-11-06 Thread Tridib Samanta
I am using spark 1.1.0. I built it using: ./make-distribution.sh -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests My ultimate goal is to execute a query on parquet file with nested structure and cast a date string to Date. This is required to calculate the age of Person

Re: Spark and Kafka

2014-11-06 Thread Eduardo Costa Alfaia
This is my window: reduceByKeyAndWindow( new Function2Integer, Integer, Integer() { @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }, new Function2Integer, Integer, Integer() { public Integer call(Integer i1, Integer i2) {

most efficient way to send data from Scala to python

2014-11-06 Thread jamborta
Hi all, Is there a way in spark to send data (RDD[Array] from the scala component to the python component? I saw a method that serialises double arrays (serializeDoubleMatrix), but it requires the data to be collected before. I assume that step would pull all the data to the driver. Thanks,

Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-06 Thread skane
I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked fine. On a larger data set I got this error: Traceback (most recent call last): File /home/skane/spark/examples/src/main/python/sort.py,

specifying sort order for sort by value

2014-11-06 Thread SK
Hi, I am using rdd.sortBy(_._2) to get an RDD sorted by value. The default order is ascending order. How can I get it sorted in descending order? I could not find an option to specify the order. I need to get the top K elements of the list sorted in descending order. If there is no option to

SparkSubmitDriverBootstrapper and JVM parameters

2014-11-06 Thread akhandeshi
/usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java org.apache.spark.deploy.SparkSubmitDriverBootstrapper When I execute /usr/local/spark-1.1.0/bin/spark-submit local[32] for my app, I see two processes get spun off. One is the org.apache.spark.deploy.SparkSubmitDriverBootstrapper and

Re: specifying sort order for sort by value

2014-11-06 Thread Akhil Das
Yes you can sort it in desc, you simply specify a boolean value in the second argument to the sortBy function. Default is ascending. So it will look like: rdd.sortBy(_._2, false) Read more over here http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD Thanks Best Regards On

Re: specifying sort order for sort by value

2014-11-06 Thread SK
Thanks. I was looking at an older RDD documentation that did not specify the ordering option. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/specifying-sort-order-for-sort-by-value-tp18289p18292.html Sent from the Apache Spark User List mailing list

Kinesis integration with Spark Streaming in EMR cluster - Output is not showing up

2014-11-06 Thread sriks
Hello, I am new to spark and trying to run the spark program (bundled as jar) in a EMR cluster. In one terminal session, i am loading data into kinesis stream. In another window, i am trying to run the spark streaming program, and trying to print out the output. Whenever i run the spark

Re: loading, querying schemaRDD using SparkSQL

2014-11-06 Thread Michael Armbrust
It can, but currently that method uses the default hive serde which is not very robust (does not deal well with \n in strings) and probably is not super fast. You'll also need to be using a HiveContext for it to work. On Tue, Nov 4, 2014 at 8:20 PM, vdiwakar.malladi vdiwakar.mall...@gmail.com

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread Terry Siu
Those are the same options I used, except I had —tgz to package it and I built off of the master branch. Unfortunately, my only guess is that these errors stem from your build environment. In your spark assembly, do you have any classes which belong to the org.apache.hadoop.hive package?

Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-06 Thread Davies Liu
It should be fixed in 1.1+. Could you have a script to reproduce it? On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote: I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked

Re: AVRO specific records

2014-11-06 Thread Simone Franzini
Benjamin, Thanks for the snippet. I have tried using it, but unfortunately I get the following exception. I am clueless at what might be wrong. Any ideas? java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at

Redploying a spark streaming application

2014-11-06 Thread Ashic Mahtab
Hello,I'm trying to find the best way of redeploying a spark streaming application. Ideally, I was thinking of a scenario where a build server packages up a jar and a deployment step submits it to a Spark Master. On the next successful build, the next version would get deployed taking down the

Re: Redploying a spark streaming application

2014-11-06 Thread Ganelin, Ilya
You’ve basically got it. Deployment step can be simply scp-ing the file to a known location on the server and then executing a run script on the server that actually runs the spark-submit. From: Ashic Mahtab as...@live.commailto:as...@live.com Date: Thursday, November 6, 2014 at 5:01 PM To:

Re: sparse x sparse matrix multiplication

2014-11-06 Thread Reza Zadeh
See this thread for examples of sparse matrix x sparse matrix: https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA We thought about providing matrix multiplies on CoordinateMatrix, however, the matrices have to be very dense for the overhead of having many little (i, j, value) objects

job works well on small data set but fails on large data set

2014-11-06 Thread HARIPRIYA AYYALASOMAYAJULA
Hello all, I am running the following operations: val part1= maOutput.toArray.flatten val part2 = sc.parallelize(part1) val reduceOutput = part2.combineByKey( (v) = (v, 1), (acc: (Double, Int), v) = ( acc._1 + v, acc._2 + 1), (acc1: (Double, Int), acc2:

Any patterns for multiplexing the streaming data

2014-11-06 Thread bdev
We are looking at consuming the kafka stream using Spark Streaming and transform into various subsets like applying some transformation or de-normalizing some fields, etc. and feed it back into Kafka as a different topic for downstream consumers. Wanted to know if there are any existing patterns

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product

Store DStreams into Hive using Hive Streaming

2014-11-06 Thread Luiz Geovani Vier
Hello, Is there a built-in way or connector to store DStream results into an existing Hive ORC table using the Hive/HCatalog Streaming API? Otherwise, do you have any suggestions regarding the implementation of such component? Thank you, -Geovani

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
model.recommendProducts can only be called from the master then ? I have a set of 20% users on whom I am performing the test...the 20% users are in a RDD...if I have to collect them all to master node and then call model.recommendProducts, that's a issue... Any idea how to optimize this so that

Re: avro + parquet + vectorstring + NullPointerException while reading

2014-11-06 Thread Michael Albert
Thanks for the advice! What seems to work for is is that I define the array type as:   type: { type: array, items: string, java-class: java.util.ArrayList }It seems to be creating an avro.Generic.List, which spark doesn't know how to serialize, instead of a guava.util.List, which spark likes.

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066 The easiest case is when one side is small. If both sides are large, this is a super-expensive operation. We can do block-wise cross product and then find top-k for each user. Best, Xiangrui On Thu, Nov 6, 2014 at 4:51 PM,

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-11-06 Thread Corey Nolet
Michael, Thanks for the explanation. I was able to get this running. On Wed, Oct 29, 2014 at 3:07 PM, Michael Armbrust mich...@databricks.com wrote: We are working on more helpful error messages, but in the meantime let me explain how to read this output.

Re: Store DStreams into Hive using Hive Streaming

2014-11-06 Thread Tathagata Das
Ted, any pointers? On Thu, Nov 6, 2014 at 4:46 PM, Luiz Geovani Vier lgv...@gmail.com wrote: Hello, Is there a built-in way or connector to store DStream results into an existing Hive ORC table using the Hive/HCatalog Streaming API? Otherwise, do you have any suggestions regarding the

Re: Store DStreams into Hive using Hive Streaming

2014-11-06 Thread Silvio Fiorito
Geovani, You can use HiveContext to do inserts into a Hive table in a Streaming app just as you would a batch app. A DStream is really a collection of RDDs so you can run the insert from within the foreachRDD. You just have to be careful that you’re not creating large amounts of small files.

Is there a way to limit the sql query result size?

2014-11-06 Thread sagi
Hi spark-users, When I use spark-sql or beeline to query a large dataset, sometimes the query result may cause driver OOM. So I wonder is there a config property in spark sql to limit the max return result size (without LIMIT clause in sql query)? For example, before the select query, I run

Re: Executor Log Rotation Is Not Working?

2014-11-06 Thread Ji ZHANG
Hi, I figure out that in standalone mode these configuration should add to worker process's configs, like adding the following line in spark-env.sh: SPARK_WORKER_OPTS=-Dspark.executor.logs.rolling.strategy=time -Dspark.executor.logs.rolling.time.interval=daily

Re: Task size variation while using Range Vs List

2014-11-06 Thread nsareen
Thanks for the response!! Will try to see the behaviour with Cache() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-size-variation-while-using-Range-Vs-List-tp18243p18318.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to trace/debug serialization?

2014-11-06 Thread Shixiong Zhu
Will this work even with Kryo Serialization ? Now spark.closure.serializer must be org.apache.spark.serializer.JavaSerializer. Therefore the serialization closure functions won’t be involved with Kryo. Kryo is only used to serialize the data. ​ Best Regards, Shixiong Zhu 2014-11-07 12:27

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
Yes. I have org.apache.hadoop.hive package in spark assembly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18322.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
I built spark-1.1.0 in a new fresh machine. This issue is gone! Thank you all for your help. Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18324.html Sent from the Apache Spark User

Re: Nesting RDD

2014-11-06 Thread Holden Karau
Hi Naveen, Nesting RDDs inside of transformations or actions is not supported. Instead if you need access to the other RDDs contents you can try doing a join or (if the data is small enough) collecting and broadcasting the second RDD. Cheers, Holden :) On Thu, Nov 6, 2014 at 10:28 PM, Naveen

Parallelize on spark context

2014-11-06 Thread Naveen Kumar Pokala
Hi, JavaRDDInteger distData = sc.parallelize(data); On what basis parallelize splits the data into multiple datasets. How to handle if we want these many datasets to be executed per executor? For example, my data is of 1000 integers list and I am having 2 node yarn cluster. It is diving into

word2vec: how to save an mllib model and reload it?

2014-11-06 Thread ll
what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context:

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks
Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Duy Huynh
that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is

RE: Parallelize on spark context

2014-11-06 Thread Naveen Kumar Pokala
Hi, In the documentation is I found something like this. spark.default.parallelism · Local mode: number of cores on the local machine · Mesos fine grained mode: 8 · Others: total number of cores on all executor nodes or 2, whichever is larger I am using 2 node cluster

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks
There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems

Re: multiple spark context in same driver program

2014-11-06 Thread Akhil Das
Hi Pawel, That doc was created during the initial days (Spark 0.8.0), you can of course create multiple sparkContexts in the same driver program now. Thanks Best Regards On Thu, Nov 6, 2014 at 9:30 PM, Paweł Szulc paul.sz...@gmail.com wrote: Hi, quick question: I found this: