Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-11-28 Thread cjdc
Hi everyone, I am using Spark 1.0.0 and I am facing some issues with handling binary snappy compressed avro files which I get form HDFS. I know there are improved mechanisms to handle these files on more recent version of Spark, but updating is not an option since I am operating on a Cloudera

Re: How to use FlumeInputDStream in spark cluster?

2014-11-28 Thread Prannoy
Hi, BindException comes when two processes are using the same port. In your spark configuration just set (spark.ui.port,x), to some other port. x can be any number say 12345. BindException will not break your job in either case. Just to fix it change the port number. Thanks. On Fri, Nov

Re: ALS failure with size Integer.MAX_VALUE

2014-11-28 Thread Bharath Ravi Kumar
Any suggestions to address the described problem? In particular, it appears that considering the skewed degree of some of the item nodes in the graph, I believe it should be possible to define better block sizes to reflect that fact, but am unsure of the way of arriving at the sizes accordingly.

Re: Using Breeze in the Scala Shell

2014-11-28 Thread dean
Debasish Das wrote For spark-shell my assumption is spark-shell -cp option should work fine Thanks for the suggestion, but this doesn't work. I tried: ./bin/spark-shell -cp commons-math3-3.2.jar -usejavacp (apparently -cp is deprecated for the scala shell as of 2.8, so -usejavacp is

Re: How to incrementally compile spark examples using mvn

2014-11-28 Thread MEETHU MATHEW
Hi,I have a similar problem.I modified the code in mllib and examples.I did mvn install -pl mllib mvn install -pl examples But when I run the program in examples using run-example,the older version of   mllib (before the changes were made) is getting executed.How to get the changes made in mllib

Re: Spark 1.1.1 released but not available on maven repositories

2014-11-28 Thread Luis Ángel Vicente Sánchez
Are there any news about this issue? I have checked again maven central and the artefacts are still not there. Regards, Luis 2014-11-27 10:42 GMT+00:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: I have just read on the website that spark 1.1.1 has been released but when I upgraded

Re: Status of MLLib exporting models to PMML

2014-11-28 Thread selvinsource
Hi, so you know, I added PMML export for linear models (linear, ridge and lasso) as suggested by Xiangrui. I will be looking at SVMs and Logistic regression next. Vincenzo -- View this message in context:

Re: RDD data checkpoint cleaning

2014-11-28 Thread Luis Ángel Vicente Sánchez
Are there any news about this issue? I was using a local folder in linux for checkpointing, file:///opt/sparkfolders/checkpoints. I think that being able to use the ReliableKafkaReceiver in a 24x7 system without having to worry about disk getting full is a reasonable expectation. Regards, Luis

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-11-28 Thread cjdc
To make it simpler, for now forget the snappy compression. Just assume they are binary Avro files... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20008.html Sent from the Apache Spark User List

Deadlock between spark logging and wildfly logging

2014-11-28 Thread Charles
We create spark context in an application running inside wildfly container. When spark context is created, we see following entires in the wildfly log. After the log4j-default.properties is loaded, every entry from spark is printed out twice. And after running for a while, we start to see deadlock

optimize multiple filter operations

2014-11-28 Thread mrm
Hi, My question is: I have multiple filter operations where I split my initial rdd into two different groups. The two groups cover the whole initial set. In code, it's something like: set1 = initial.filter(lambda x: x == something) set2 = initial.filter(lambda x: x != something) By doing

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-28 Thread jatinpreet
Thanks Sean, it did turn out to be a simple mistake after all. I appreciate your help. Jatin On Thu, Nov 27, 2014 at 7:52 PM, sowen [via Apache Spark User List] ml-node+s1001560n19975...@n3.nabble.com wrote: No, the feature vector is not converted. It contains count n_i of how often each

Re: Deadlock between spark logging and wildfly logging

2014-11-28 Thread Sean Owen
Are you sure it's deadlock? print the thread dump (from kill -QUIT) of the thread(s) that are deadlocked, I suppose, to show where the issue is. It seems unlikely that a logging thread would be holding locks that the app uses. On Fri, Nov 28, 2014 at 4:01 PM, Charles charles...@cenx.com wrote:

Re: Deadlock between spark logging and wildfly logging

2014-11-28 Thread Charles
Here you go. Result resolver thread-3 - Thread t@35654 java.lang.Thread.State: BLOCKED at java.io.PrintStream.flush(PrintStream.java:335) - waiting to lock 104f7200 (a java.io.PrintStream) owned by null_Worker-1 t@1022 at

Re: Mesos killing Spark Driver

2014-11-28 Thread Gerard Maas
[Ping] Any hints? On Thu, Nov 27, 2014 at 3:38 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, We are currently running our Spark + Spark Streaming jobs on Mesos, submitting our jobs through Marathon. We see with some regularity that the Spark Streaming driver gets killed by Mesos and

Re: Calling spark from a java web application.

2014-11-28 Thread adrian
This may help: https://github.com/spark-jobserver/spark-jobserver On Fri, Nov 28, 2014 at 6:59 AM, Jamal [via Apache Spark User List] ml-node+s1001560n20007...@n3.nabble.com wrote: Hi, Any recommendation or tutorial on calling spark from java web application. Current setup: A spring java

Re: Creating a SchemaRDD from an existing API

2014-11-28 Thread Michael Armbrust
You probably don't need to create a new kind of SchemaRDD. Instead I'd suggest taking a look at the data sources API that we are adding in Spark 1.2. There is not a ton of documentation, but the test cases show how to implement the various interfaces

Re: Spark 1.1.1 released but not available on maven repositories

2014-11-28 Thread Andrew Or
Hi Luis, There seems to be a delay in the 1.1.1 artifacts being pushed to our apache mirrors. We are working with the infra people to get them up as soon as possible. Unfortunately, due to the national holiday weekend in the US this may take a little longer than expected, however. For now you may

Re: optimize multiple filter operations

2014-11-28 Thread Rishi Yadav
you can try (scala version = you convert to python) val set = initial.groupBy( x = if (x == something) key1 else key2) This would do one pass over original data. On Fri, Nov 28, 2014 at 8:21 AM, mrm ma...@skimlinks.com wrote: Hi, My question is: I have multiple filter operations where I