Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-11-28 Thread cjdc
Hi everyone,

I am using Spark 1.0.0 and I am facing some issues with handling binary
snappy compressed avro files which I get form HDFS. I know there are
improved mechanisms to handle these files on more recent version of Spark,
but updating is not an option since I am operating on a Cloudera cluster
with no admin privileges.

I would simply like to get some of these avro files, create de RDD and then
do simple SQL queries to their content.
By following Spark SQL 1.0.0 Programming Guide, we have:

*/val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._

val myData = sc.textFile("/example/mydir/MyFile1.avro")
### QUESTION ###
### How to dynamically define the schema from the Avro header?? ###
#
# val Schema = 


myData.registerAsTable("MyDB")

val query = sql("SELECT * FROM MyDB")
query.collect().foreach(println)/*

so, how would you modify this to make it work (considering the Spark
version)?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to use FlumeInputDStream in spark cluster?

2014-11-28 Thread Prannoy
Hi,

BindException comes when two processes are using the same port. In your
spark configuration just set ("spark.ui.port","x"),
to some other port. x can be any number say 12345. BindException will
not break your job in either case. Just to fix it change the port number.

Thanks.

On Fri, Nov 28, 2014 at 1:30 PM, pamtang [via Apache Spark User List] <
ml-node+s1001560n1999...@n3.nabble.com> wrote:

> I'm seeing the same issue on CDH 5.2 with Spark 1.1. FlumeEventCount works
> fine on a Standalone cluster but throw BindException on YARN mode. Is there
> a solution to this problem or FlumeInputDStream will not be working in a
> cluster environment?
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p19997.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p1.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ALS failure with size > Integer.MAX_VALUE

2014-11-28 Thread Bharath Ravi Kumar
Any suggestions to address the described problem? In particular, it appears
that considering the skewed degree of some of the item nodes in the graph,
I believe it should be possible to define better block sizes to reflect
that fact, but am unsure of the way of arriving at the sizes accordingly.

Thanks,
Bharath

On Fri, Nov 28, 2014 at 12:00 AM, Bharath Ravi Kumar 
wrote:

> We're training a recommender with ALS in mllib 1.1 against a dataset of
> 150M users and 4.5K items, with the total number of training records being
> 1.2 Billion (~30GB data). The input data is spread across 1200 partitions
> on HDFS. For the training, rank=10, and we've configured {number of user
> data blocks = number of item data blocks}. The number of user/item blocks
> was varied  between 50 to 1200. Irrespective of the block size (e.g. at
> 1200 blocks each), there are atleast a couple of tasks that end up shuffle
> reading > 9.7G each in the aggregate stage (ALS.scala:337) and failing with
> the following exception:
>
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
> at
> org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
> at
> org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
> at
> org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)
>
>
>
>
> As for the data, on the user side, the degree of a node in the
> connectivity graph is relatively small. However, on the item side, 3.8K out
> of the 4.5K items are connected to 10^5 users each on an average, with 100
> items being connected to nearly 10^8 users. The rest of the items are
> connected to less than 10^5 users. With such a skew in the connectivity
> graph, I'm unsure if additional memory or variation in the block sizes
> would help (considering my limited understanding of the implementation in
> mllib). Any suggestion to address the problem?
>
>
> The test is being run on a standalone cluster of 3 hosts, each with 100G
> RAM & 24 cores dedicated to the application. The additional configs I made
> specific to the shuffle and task failure reduction are as follows:
>
> spark.core.connection.ack.wait.timeout=600
> spark.shuffle.consolidateFiles=true
> spark.shuffle.manager=SORT
>
>
> The job execution summary is as follows:
>
> Active Stages:
>
> Stage id 2,  aggregate at ALS.scala:337, duration 55 min, Tasks 1197/1200
> (3 failed), Shuffle Read :  141.6 GB
>
> Completed Stages (5)
> Stage IdDescriptionDuration
> Tasks: Succeeded/TotalInputShuffle ReadShuffle Write
> 6org.apache.spark.rdd.RDD.flatMap(RDD.scala:277) 12 min
>  1200/120029.9 GB1668.4 MB186.8 GB
>
> 5mapPartitionsWithIndex at ALS.scala:250 +details
>
> 3map at ALS.scala:231
>
> 0aggregate at ALS.scala:337 +details
>
> 1map at ALS.scala:228 +details
>
>
> Thanks,
> Bharath
>


Re: Using Breeze in the Scala Shell

2014-11-28 Thread dean
Debasish Das wrote
> For spark-shell my assumption is spark-shell -cp option should work
> fine

Thanks for the suggestion, but this doesn't work. I tried:

./bin/spark-shell -cp commons-math3-3.2.jar -usejavacp

(apparently -cp is deprecated for the scala shell as of 2.8, so -usejavacp
is required). And then:

scala> import breeze.stats.distributions.Geometric
import breeze.stats.distributions.Geometric

scala> val geom = new Geometric(0.5)
java.lang.NoClassDefFoundError:
org/apache/commons/math3/random/RandomGenerator
...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Breeze-in-the-Scala-Shell-tp19979p20001.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to incrementally compile spark examples using mvn

2014-11-28 Thread MEETHU MATHEW
Hi,I have a similar problem.I modified the code in mllib and examples.I did mvn 
install -pl mllib mvn install -pl examples
But when I run the program in examples using run-example,the older version of  
mllib (before the changes were made) is getting executed.How to get the changes 
made in mllib while  calling it from examples project? Thanks & Regards,
Meethu M 

 On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang 
 wrote:
   

 Thank you, Marcelo and Sean, "mvn install" is a good answer for my demands. 

-邮件原件-
发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 
发送时间: 2014年11月21日 1:47
收件人: yiming zhang
抄送: Sean Owen; user@spark.apache.org
主题: Re: How to incrementally compile spark examples using mvn

Hi Yiming,

On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang  wrote:
> Thank you for your reply. I was wondering whether there is a method of 
> reusing locally-built components without installing them? That is, if I have 
> successfully built the spark project as a whole, how should I configure it so 
> that I can incrementally build (only) the "spark-examples" sub project 
> without the need of downloading or installation?

As Sean suggest, you shouldn't need to install anything. After "mvn install", 
your local repo is a working Spark installation, and you can use spark-submit 
and other tool directly within it.

You just need to remember to rebuild the assembly/ project when modifying Spark 
code (or the examples/ project when modifying examples).


--
Marcelo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


   

Re: Spark 1.1.1 released but not available on maven repositories

2014-11-28 Thread Luis Ángel Vicente Sánchez
Are there any news about this issue? I have checked again maven central and
the artefacts are still not there.

Regards,

Luis

2014-11-27 10:42 GMT+00:00 Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com>:

> I have just read on the website that spark 1.1.1 has been released but
> when I upgraded my project to use 1.1.1 I discovered that the artefacts are
> not on maven yet.
>
> [info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ...
>>
>> [warn] module not found:
>>> org.apache.spark#spark-streaming-kafka_2.10;1.1.1
>>
>> [warn]  local: tried
>>
>> [warn]
>>> /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.10/1.1.1/ivys/ivy.xml
>>
>> [warn]  public: tried
>>
>> [warn]
>>> https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom
>>
>> [warn]  sonatype snapshots: tried
>>
>> [warn]
>>> https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom
>>
>> [info] Resolving org.apache.spark#spark-core_2.10;1.1.1 ...
>>
>> [warn] module not found: org.apache.spark#spark-core_2.10;1.1.1
>>
>> [warn]  local: tried
>>
>> [warn]
>>> /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-core_2.10/1.1.1/ivys/ivy.xml
>>
>> [warn]  public: tried
>>
>> [warn]
>>> https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom
>>
>> [warn]  sonatype snapshots: tried
>>
>> [warn]
>>> https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom
>>
>> [info] Resolving org.apache.spark#spark-streaming_2.10;1.1.1 ...
>>
>> [warn] module not found: org.apache.spark#spark-streaming_2.10;1.1.1
>>
>> [warn]  local: tried
>>
>> [warn]
>>> /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming_2.10/1.1.1/ivys/ivy.xml
>>
>> [warn]  public: tried
>>
>> [warn]
>>> https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom
>>
>> [warn]  sonatype snapshots: tried
>>
>> [warn]
>>> https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom
>>>
>>


Re: Status of MLLib exporting models to PMML

2014-11-28 Thread selvinsource
Hi,

so you know, I added PMML export for linear models (linear, ridge and lasso)
as suggested by Xiangrui. 

I will be looking at SVMs and Logistic regression next.

Vincenzo



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLLib-exporting-models-to-PMML-tp18514p20005.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD data checkpoint cleaning

2014-11-28 Thread Luis Ángel Vicente Sánchez
Are there any news about this issue? I was using a local folder in linux
for checkpointing, "file:///opt/sparkfolders/checkpoints". I think that
being able to use the ReliableKafkaReceiver in a 24x7 system without having
to worry about disk getting full is a reasonable expectation.

Regards,

Luis

2014-11-21 15:17 GMT+00:00 Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com>:

> I have seen the same behaviour while testing the latest spark 1.2.0
> snapshot.
>
> I'm trying the ReliableKafkaReceiver and it works quite well but the
> checkpoints folder is always increasing in size. The receivedMetaData
> folder remains almost constant in size but the receivedData folder is
> always increasing in size even if I set spark.cleaner.ttl to 300 seconds.
>
> Regards,
>
> Luis
>
> 2014-09-23 22:47 GMT+01:00 RodrigoB :
>
>> Just a follow-up.
>>
>> Just to make sure about the RDDs not being cleaned up, I just replayed the
>> app both on the windows remote laptop and then on the linux machine and at
>> the same time was observing the RDD folders in HDFS.
>>
>> Confirming the observed behavior: running on the laptop I could see the
>> RDDs
>> continuously increasing. When I ran on linux, only two RDD folders were
>> there and continuously being recycled.
>>
>> Metadata checkpoints were being cleaned on both scenarios.
>>
>> tnks,
>> Rod
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847p14939.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-11-28 Thread cjdc
To make it simpler, for now forget the snappy compression. Just assume they
are binary Avro files...





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20008.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Deadlock between spark logging and wildfly logging

2014-11-28 Thread Charles
We create spark context in an application running inside wildfly container.
When spark context is created, we see following entires in the wildfly log.
After the log4j-default.properties is loaded, every entry from spark is
printed out twice. And after running for a while, we start to see deadlock
between spark logging thread and wildfly logging thread. 

 Can I control the spark logging in the driver application? How can I turn
it off in the driver application? How can I control the level of spark logs
in the driver application?

2014-11-27 14:39:26,719 INFO  [akka.event.slf4j.Slf4jLogger]
(spark-akka.actor.default-dispatcher-4) Slf4jLogger started
2014-11-27 14:39:26,917 INFO  [Remoting]
(spark-akka.actor.default-dispatcher-2) Starting remoting
2014-11-27 14:39:27,719 INFO  [Remoting]
(spark-akka.actor.default-dispatcher-2) Remoting started; listening on
addresses :[akka.tcp://spark@172.32.1.12:43918]
2014-11-27 14:39:27,733 INFO  [Remoting]
(spark-akka.actor.default-dispatcher-2) Remoting now listens on addresses:
[akka.tcp://spark@172.32.1.12:43918]
2014-11-27 14:39:27,892 INFO  [org.apache.spark.SparkEnv] (MSC service
thread 1-16) Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
2014-11-27 14:39:27,895 ERROR [stderr] (MSC service thread 1-16) 14/11/27
14:39:27 INFO SparkEnv: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
2014-11-27 14:39:27,896 INFO  [org.apache.spark.SparkEnv] (MSC service
thread 1-16) Registering BlockManagerMaster
2014-11-27 14:39:27,896 ERROR [stderr] (MSC service thread 1-16) 14/11/27
14:39:27 INFO SparkEnv: Registering BlockManagerMaster
2014-11-27 14:39:28,041 INFO  [org.apache.spark.storage.DiskBlockManager]
(MSC service thread 1-16) Created local directory at
/tmp/spark-local-20141127143928-d33c
2014-11-27 14:39:28,041 ERROR [stderr] (MSC service thread 1-16) 14/11/27
14:39:28 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20141127143928-d33c
2014-11-27 14:39:28,055 INFO  [org.apache.spark.storage.MemoryStore] (MSC
service thread 1-16) MemoryStore started with capacity 4.3 GB.
2014-11-27 14:39:28,055 ERROR [stderr] (MSC service thread 1-16) 14/11/27
14:39:28 INFO MemoryStore: MemoryStore started with capacity 4.3 GB.
2014-11-27 14:39:28,117 INFO  [org.apache.spark.network.ConnectionManager]
(MSC service thread 1-16) Bound socket to port 34018 with id =
ConnectionManagerId(ip-172-32-1-12,34018)
2014-11-27 14:39:28,118 ERROR [stderr] (MSC service thread 1-16) 14/11/27
14:39:28 INFO ConnectionManager: Bound socket to port 34018 with id =
ConnectionManagerId(ip-172-32-1-12,34018)
2014-11-27 14:39:28,162 INFO  [org.apache.spark.storage.BlockManagerMaster]
(MSC service thread 1-16) Trying to register BlockManager
2014-11-27 14:39:28,163 ERROR [stderr] (MSC service thread 1-16) 14/11/27
14:39:28 INFO BlockManagerMaster: Trying to register BlockManager
2014-11-27 14:39:28,181 INFO 
[org.apache.spark.storage.BlockManagerMasterActor$BlockManagerInfo]
(spark-akka.actor.default-dispatcher-3) Registering block manager
ip-172-32-1-12:34018 with 4.3 GB RAM
2014-11-27 14:39:28,185 ERROR [stderr]
(spark-akka.actor.default-dispatcher-3) 14/11/27 14:39:28 INFO
BlockManagerMasterActor$BlockManagerInfo: Registering block manager
ip-172-32-1-12:34018 with 4.3 GB RAM



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Deadlock-between-spark-logging-and-wildfly-logging-tp20009.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



optimize multiple filter operations

2014-11-28 Thread mrm
Hi, 

My question is:

I have multiple filter operations where I split my initial rdd into two
different groups. The two groups cover the whole initial set. In code, it's
something like:

set1 = initial.filter(lambda x: x == something)
set2 = initial.filter(lambda x: x != something)

By doing this, I am doing two passes over the data. Is there any way to
optimise this to do it in a single pass?

Note: I was trying to look in the mailing list to see if this question has
been asked already, but could not find it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accessing posterior probability of Naive Baye's prediction

2014-11-28 Thread jatinpreet
Thanks Sean, it did turn out to be a simple mistake after all. I appreciate
your help.

Jatin

On Thu, Nov 27, 2014 at 7:52 PM, sowen [via Apache Spark User List] <
ml-node+s1001560n19975...@n3.nabble.com> wrote:

> No, the feature vector is not converted. It contains count n_i of how
> often each term t_i occurs (or a TF-IDF transformation of those). You
> are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is
> maximized.
>
> In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ...
>
> So your n_1 counts (or TF-IDF values) are used as-is and this is where
> the dot product comes from.
>
> Your bug is probably something lower-level and simple. I'd debug the
> Spark example and print exactly its values for the log priors and
> conditional probabilities, and the matrix operations, and yours too,
> and see where the difference is.
>
> On Thu, Nov 27, 2014 at 11:37 AM, jatinpreet <[hidden email]
> > wrote:
>
> > Hi,
> >
> > I have been running through some troubles while converting the code to
> Java.
> > I have done the matrix operations as directed and tried to find the
> maximum
> > score for each category. But the predicted category is mostly different
> from
> > the prediction done by MLlib.
> >
> > I am fetching iterators of the pi, theta and testData to do my
> calculations.
> > pi and theta are in  log space while my testData vector is not, could
> that
> > be a problem because I didn't see explicit conversion in Mllib also?
> >
> > For example, for two categories and 5 features, I am doing the following
> > operation,
> >
> > [1,2] + [1 2 3 4 5  ] * [1,2,3,4,5]
> >[6 7 8 9 10]
> > These are simple element-wise matrix multiplication and addition
> operators.
>
> -
> To unsubscribe, e-mail: [hidden email]
> 
> For additional commands, e-mail: [hidden email]
> 
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19975.html
>  To unsubscribe from Accessing posterior probability of Naive Baye's
> prediction, click here
> 
> .
> NAML
> 
>



-- 
Regards,
Jatinpreet Singh




-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p20011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Deadlock between spark logging and wildfly logging

2014-11-28 Thread Sean Owen
Are you sure it's deadlock? print the thread dump (from kill -QUIT) of
the thread(s) that are deadlocked, I suppose, to show where the issue
is. It seems unlikely that a logging thread would be holding locks
that the app uses.

On Fri, Nov 28, 2014 at 4:01 PM, Charles  wrote:
> We create spark context in an application running inside wildfly container.
> When spark context is created, we see following entires in the wildfly log.
> After the log4j-default.properties is loaded, every entry from spark is
> printed out twice. And after running for a while, we start to see deadlock
> between spark logging thread and wildfly logging thread.
>
>  Can I control the spark logging in the driver application? How can I turn
> it off in the driver application? How can I control the level of spark logs
> in the driver application?
>
> 2014-11-27 14:39:26,719 INFO  [akka.event.slf4j.Slf4jLogger]
> (spark-akka.actor.default-dispatcher-4) Slf4jLogger started
> 2014-11-27 14:39:26,917 INFO  [Remoting]
> (spark-akka.actor.default-dispatcher-2) Starting remoting
> 2014-11-27 14:39:27,719 INFO  [Remoting]
> (spark-akka.actor.default-dispatcher-2) Remoting started; listening on
> addresses :[akka.tcp://spark@172.32.1.12:43918]
> 2014-11-27 14:39:27,733 INFO  [Remoting]
> (spark-akka.actor.default-dispatcher-2) Remoting now listens on addresses:
> [akka.tcp://spark@172.32.1.12:43918]
> 2014-11-27 14:39:27,892 INFO  [org.apache.spark.SparkEnv] (MSC service
> thread 1-16) Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 2014-11-27 14:39:27,895 ERROR [stderr] (MSC service thread 1-16) 14/11/27
> 14:39:27 INFO SparkEnv: Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 2014-11-27 14:39:27,896 INFO  [org.apache.spark.SparkEnv] (MSC service
> thread 1-16) Registering BlockManagerMaster
> 2014-11-27 14:39:27,896 ERROR [stderr] (MSC service thread 1-16) 14/11/27
> 14:39:27 INFO SparkEnv: Registering BlockManagerMaster
> 2014-11-27 14:39:28,041 INFO  [org.apache.spark.storage.DiskBlockManager]
> (MSC service thread 1-16) Created local directory at
> /tmp/spark-local-20141127143928-d33c
> 2014-11-27 14:39:28,041 ERROR [stderr] (MSC service thread 1-16) 14/11/27
> 14:39:28 INFO DiskBlockManager: Created local directory at
> /tmp/spark-local-20141127143928-d33c
> 2014-11-27 14:39:28,055 INFO  [org.apache.spark.storage.MemoryStore] (MSC
> service thread 1-16) MemoryStore started with capacity 4.3 GB.
> 2014-11-27 14:39:28,055 ERROR [stderr] (MSC service thread 1-16) 14/11/27
> 14:39:28 INFO MemoryStore: MemoryStore started with capacity 4.3 GB.
> 2014-11-27 14:39:28,117 INFO  [org.apache.spark.network.ConnectionManager]
> (MSC service thread 1-16) Bound socket to port 34018 with id =
> ConnectionManagerId(ip-172-32-1-12,34018)
> 2014-11-27 14:39:28,118 ERROR [stderr] (MSC service thread 1-16) 14/11/27
> 14:39:28 INFO ConnectionManager: Bound socket to port 34018 with id =
> ConnectionManagerId(ip-172-32-1-12,34018)
> 2014-11-27 14:39:28,162 INFO  [org.apache.spark.storage.BlockManagerMaster]
> (MSC service thread 1-16) Trying to register BlockManager
> 2014-11-27 14:39:28,163 ERROR [stderr] (MSC service thread 1-16) 14/11/27
> 14:39:28 INFO BlockManagerMaster: Trying to register BlockManager
> 2014-11-27 14:39:28,181 INFO
> [org.apache.spark.storage.BlockManagerMasterActor$BlockManagerInfo]
> (spark-akka.actor.default-dispatcher-3) Registering block manager
> ip-172-32-1-12:34018 with 4.3 GB RAM
> 2014-11-27 14:39:28,185 ERROR [stderr]
> (spark-akka.actor.default-dispatcher-3) 14/11/27 14:39:28 INFO
> BlockManagerMasterActor$BlockManagerInfo: Registering block manager
> ip-172-32-1-12:34018 with 4.3 GB RAM
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Deadlock-between-spark-logging-and-wildfly-logging-tp20009.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Deadlock between spark logging and wildfly logging

2014-11-28 Thread Charles
Here you go. 

"Result resolver thread-3" - Thread t@35654
   java.lang.Thread.State: BLOCKED
at java.io.PrintStream.flush(PrintStream.java:335)
- waiting to lock <104f7200> (a java.io.PrintStream) owned by
"null_Worker-1" t@1022
at
org.jboss.stdio.StdioContext$DelegatingPrintStream.flush(StdioContext.java:216)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
- locked <13e0275f> (a java.io.OutputStreamWriter)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
- locked <1af95e38> (a org.apache.log4j.ConsoleAppender)
at
org.apache.log4j.JBossAppenderHandler.doPublish(JBossAppenderHandler.java:42)
at org.jboss.logmanager.ExtHandler.publish(ExtHandler.java:79)
at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:296)
at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
at org.jboss.logmanager.Logger.logRaw(Logger.java:721)
at org.slf4j.impl.Slf4jLogger.log(Slf4jLogger.java:326)
at org.slf4j.impl.Slf4jLogger.log(Slf4jLogger.java:320)
at org.slf4j.impl.Slf4jLogger.info(Slf4jLogger.java:180)
at org.apache.spark.Logging$class.logInfo(Logging.scala:50)


"null_Worker-1" - Thread t@1022
   java.lang.Thread.State: BLOCKED
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:231)
- waiting to lock <1af95e38> (a org.apache.log4j.ConsoleAppender) owned 
by
"Result resolver thread-3" t@35654
at
org.apache.log4j.JBossAppenderHandler.doPublish(JBossAppenderHandler.java:42)
at org.jboss.logmanager.ExtHandler.publish(ExtHandler.java:79)
at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:296)
at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
at org.jboss.logmanager.Logger.logRaw(Logger.java:721)
at org.jboss.logmanager.Logger.log(Logger.java:506)
at
org.jboss.stdio.AbstractLoggingWriter.write(AbstractLoggingWriter.java:71)
- locked <2a11d902> (a java.lang.StringBuilder)
at 
org.jboss.stdio.WriterOutputStream.finish(WriterOutputStream.java:143)
at org.jboss.stdio.WriterOutputStream.flush(WriterOutputStream.java:164)
- locked <2c765985> (a sun.nio.cs.US_ASCII$Decoder)
at java.io.PrintStream.write(PrintStream.java:482)
- locked <104f7200> (a java.io.PrintStream)
at
org.jboss.stdio.StdioContext$DelegatingPrintStream.write(StdioContext.java:264)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
- locked <15f65ea5> (a java.io.OutputStreamWriter)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.PrintWriter.flush(PrintWriter.java:320)
- locked <15f65ea5> (a java.io.OutputStreamWriter)
at clojure.core$flush.invoke(core.clj:3429)
at taoensso.timbre$str_println.doInvoke(timbre.clj:15)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at taoensso.timbre$fn__3179.invoke(timbre.clj:170)
at clojure.core$juxt$fn__4209.invoke(core.clj:2433)
at
taoensso.timbre$wrap_appender_juxt$fn__3244$fn__3248.invoke(timbre.clj:297)
at
taoensso.timbre$wrap_appender_juxt$fn__3229$fn__3231.invoke(timbre.clj:319)
at taoensso.timbre$send_to_appenders_BANG_.doInvoke(timbre.clj:398)
at clojure.lang.RestFn.invoke(RestFn.java:866)
at
cenx.levski.performance_exception_calculation$scheduled_exception_calculation.doInvoke(performance_exception_calculation.clj:207)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Deadlock-between-spark-logging-and-wildfly-logging-tp20009p20013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Deadlock between spark logging and wildfly logging

2014-11-28 Thread Sean Owen
I'm kind of guessing here, but it looks like the log4j class has
locked its own internal data structure (ConsoleAppender) and then has
a lock on System.err. The JBoss logger looks like it swaps out
System.err (?) with its logging stub, and has that locked, and then
the stub routes to log4j to log, which is locked.

I wonder if you can get jboss to not do that, or instead log directly
to log4j. I sort of doubt it. I also think the 'embedded Spark' use
case is technically unsupported.Maybe you can configure log4j to not
log anything from jboss?

On Fri, Nov 28, 2014 at 4:59 PM, Charles  wrote:
> Here you go.
>
> "Result resolver thread-3" - Thread t@35654
>java.lang.Thread.State: BLOCKED
> at java.io.PrintStream.flush(PrintStream.java:335)
> - waiting to lock <104f7200> (a java.io.PrintStream) owned by
> "null_Worker-1" t@1022
> at
> org.jboss.stdio.StdioContext$DelegatingPrintStream.flush(StdioContext.java:216)
> at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297)
> at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
> - locked <13e0275f> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
> at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59)
> at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324)
> at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
> at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
> - locked <1af95e38> (a org.apache.log4j.ConsoleAppender)
> at
> org.apache.log4j.JBossAppenderHandler.doPublish(JBossAppenderHandler.java:42)
> at org.jboss.logmanager.ExtHandler.publish(ExtHandler.java:79)
> at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:296)
> at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
> at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
> at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
> at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
> at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
> at org.jboss.logmanager.Logger.logRaw(Logger.java:721)
> at org.slf4j.impl.Slf4jLogger.log(Slf4jLogger.java:326)
> at org.slf4j.impl.Slf4jLogger.log(Slf4jLogger.java:320)
> at org.slf4j.impl.Slf4jLogger.info(Slf4jLogger.java:180)
> at org.apache.spark.Logging$class.logInfo(Logging.scala:50)
>
>
> "null_Worker-1" - Thread t@1022
>java.lang.Thread.State: BLOCKED
> at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:231)
> - waiting to lock <1af95e38> (a org.apache.log4j.ConsoleAppender) 
> owned by
> "Result resolver thread-3" t@35654
> at
> org.apache.log4j.JBossAppenderHandler.doPublish(JBossAppenderHandler.java:42)
> at org.jboss.logmanager.ExtHandler.publish(ExtHandler.java:79)
> at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:296)
> at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:304)
> at org.jboss.logmanager.Logger.logRaw(Logger.java:721)
> at org.jboss.logmanager.Logger.log(Logger.java:506)
> at
> org.jboss.stdio.AbstractLoggingWriter.write(AbstractLoggingWriter.java:71)
> - locked <2a11d902> (a java.lang.StringBuilder)
> at 
> org.jboss.stdio.WriterOutputStream.finish(WriterOutputStream.java:143)
> at 
> org.jboss.stdio.WriterOutputStream.flush(WriterOutputStream.java:164)
> - locked <2c765985> (a sun.nio.cs.US_ASCII$Decoder)
> at java.io.PrintStream.write(PrintStream.java:482)
> - locked <104f7200> (a java.io.PrintStream)
> at
> org.jboss.stdio.StdioContext$DelegatingPrintStream.write(StdioContext.java:264)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
> at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
> - locked <15f65ea5> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
> at java.io.PrintWriter.flush(PrintWriter.java:320)
> - locked <15f65ea5> (a java.io.OutputStreamWriter)
> at clojure.core$flush.invoke(core.clj:3429)
> at taoensso.timbre$str_println.doInvoke(timbre.clj:15)
> at clojure.lang.RestFn.invoke(RestFn.java:408)
> at taoensso.timbre$fn__3179.invoke(timbre.clj:170)
> at clojure.core$juxt$fn__4209.invoke(core.clj:2433)
> at
> taoensso.timbre$wrap_appender_juxt$fn__3244$fn__3248.invoke(timbre.clj:297)
> at
> taoensso.timbre$wrap_appender_juxt$fn__3229$fn__3231.invoke(timbre.clj:319)
> at taoensso.timbre$send_to_appenders_BANG_.doInvoke(timbre.clj:398)
> at

Understanding and optimizing spark disk usage during a job.

2014-11-28 Thread Jaonary Rabarisoa
Dear all,

I have a job that crashes before its end because of no space left on
device, and I noticed that this job generates a lots of temporary data on
my disk.

To be precise, the job is a simple map job that takes a set of images,
extracts local features and save these local features as a sequence file.
My images are represented as a key value pair where the key are strings
representing the id of the image (the filename) and the values are the
base64 encoding of the images.

To extract the features, I use an external c program that I call with
RDD.pipe. I stream the base64 image to the c program and it sends back the
extracted feature vectors through stdout. Each line represents one feature
vector from the current image. I don't use any serialization library, I
just write the feature vector element on the stdout separated by space.
Once in spark, I just split the line and create a scala vector from each
value and save my sequence file.

The overall job looks like the following :

val images: RDD[(String, String) = ...
val features: RDD[(String, Vector)] = images.pipe(...).map(_split(" ")...)
features.saveAsSequenceFile(...)

The problem is that for about 3G of image data (about 12000 images) this
job generates more than 180G of temporary data. It seems to be strange
since for each image I have about 4000 double feature vectors of dimension
400.

I run the job on my laptop for test purpose that why I can't add additional
disk space. By the way, I need to understand why this simple job generates
such a lot of data and how can I reduce this ?


Best,

Jao


Re: Mesos killing Spark Driver

2014-11-28 Thread Gerard Maas
[Ping]
Any hints?

On Thu, Nov 27, 2014 at 3:38 PM, Gerard Maas  wrote:

> Hi,
>
> We are currently running our Spark + Spark Streaming jobs on Mesos,
> submitting our jobs through Marathon.
> We see with some regularity that the Spark Streaming driver gets killed by
> Mesos and then restarted on some other node by Marathon.
>
> I've no clue why Mesos is killing the driver and looking at both the Mesos
> and Spark logs didn't make me any wiser.
>
> On the Spark Streaming driver logs, I find this entry of Mesos "signing
> off" my driver:
>
> Shutting down
> Sending SIGTERM to process tree at pid 17845
> Killing the following process trees:
> [
> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>\--- 17847 java -cp core-compute-job.jar
> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
> ]
> Command terminated with signal Terminated (pid: 17845)
>
>
> Have anybody seen something similar? Any hints on where to start digging?
>
> -kr, Gerard.
>
>


Re: Calling spark from a java web application.

2014-11-28 Thread adrian
This may help:
https://github.com/spark-jobserver/spark-jobserver

On Fri, Nov 28, 2014 at 6:59 AM, Jamal [via Apache Spark User List] <
ml-node+s1001560n20007...@n3.nabble.com> wrote:

> Hi,
>
> Any recommendation or tutorial on calling spark from java web application.
>
> Current setup:
> A spring java web application running on Jetty.
>
> Requirement:
> Need to run queries from the web app to spark sql.
>
> Please point or recommend the process. I can only find standalone app
> example on the spark webiste.
>
> Obliged
> Jamal
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Calling-spark-from-a-java-web-application-tp20007.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calling-spark-from-a-java-web-application-tp20007p20017.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Creating a SchemaRDD from an existing API

2014-11-28 Thread Michael Armbrust
You probably don't need to create a new kind of SchemaRDD.  Instead I'd
suggest taking a look at the data sources API that we are adding in Spark
1.2.  There is not a ton of documentation, but the test cases show how to
implement the various interfaces
,
and there is an example library for reading Avro data
.

On Thu, Nov 27, 2014 at 10:31 PM, Niranda Perera  wrote:

> Hi,
>
> I am evaluating Spark for an analytic component where we do batch
> processing of data using SQL.
>
> So, I am particularly interested in Spark SQL and in creating a SchemaRDD
> from an existing API [1].
>
> This API exposes elements in a database as datasources. Using the methods
> allowed by this data source, we can access and edit data.
>
> So, I want to create a custom SchemaRDD using the methods and provisions of
> this API. I tried going through Spark documentation and the Java Docs, but
> unfortunately, I was unable to come to a final conclusion if this was
> actually possible.
>
> I would like to ask the Spark Devs,
> 1. As of the current Spark release, can we make a custom SchemaRDD?
> 2. What is the extension point to a custom SchemaRDD? or are there
> particular interfaces?
> 3. Could you please point me the specific docs regarding this matter?
>
> Your help in this regard is highly appreciated.
>
> Cheers
>
> [1]
>
> https://github.com/wso2-dev/carbon-analytics/tree/master/components/xanalytics
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 
>


Re: Spark 1.1.1 released but not available on maven repositories

2014-11-28 Thread Andrew Or
Hi Luis,

There seems to be a delay in the 1.1.1 artifacts being pushed to our apache
mirrors. We are working with the infra people to get them up as soon as
possible. Unfortunately, due to the national holiday weekend in the US this
may take a little longer than expected, however. For now you may use
https://repository.apache.org/content/repositories/orgapachespark-1043/ as
your Spark 1.1.1 repository instead. This is the one that hosted the final
RC that was later made into the official release.

-Andrew



2014-11-28 2:16 GMT-08:00 Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com>:

> Are there any news about this issue? I have checked again maven central
> and the artefacts are still not there.
>
> Regards,
>
> Luis
>
> 2014-11-27 10:42 GMT+00:00 Luis Ángel Vicente Sánchez <
> langel.gro...@gmail.com>:
>
>> I have just read on the website that spark 1.1.1 has been released but
>> when I upgraded my project to use 1.1.1 I discovered that the artefacts are
>> not on maven yet.
>>
>> [info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ...
>>>
>>> [warn] module not found:
 org.apache.spark#spark-streaming-kafka_2.10;1.1.1
>>>
>>> [warn]  local: tried
>>>
>>> [warn]
 /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.10/1.1.1/ivys/ivy.xml
>>>
>>> [warn]  public: tried
>>>
>>> [warn]
 https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom
>>>
>>> [warn]  sonatype snapshots: tried
>>>
>>> [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom
>>>
>>> [info] Resolving org.apache.spark#spark-core_2.10;1.1.1 ...
>>>
>>> [warn] module not found: org.apache.spark#spark-core_2.10;1.1.1
>>>
>>> [warn]  local: tried
>>>
>>> [warn]
 /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-core_2.10/1.1.1/ivys/ivy.xml
>>>
>>> [warn]  public: tried
>>>
>>> [warn]
 https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom
>>>
>>> [warn]  sonatype snapshots: tried
>>>
>>> [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom
>>>
>>> [info] Resolving org.apache.spark#spark-streaming_2.10;1.1.1 ...
>>>
>>> [warn] module not found: org.apache.spark#spark-streaming_2.10;1.1.1
>>>
>>> [warn]  local: tried
>>>
>>> [warn]
 /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming_2.10/1.1.1/ivys/ivy.xml
>>>
>>> [warn]  public: tried
>>>
>>> [warn]
 https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom
>>>
>>> [warn]  sonatype snapshots: tried
>>>
>>> [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom

>>>
>


Re: optimize multiple filter operations

2014-11-28 Thread Rishi Yadav
you can try (scala version => you convert to python)

val set = initial.groupBy( x => if (x == something) "key1" else "key2")

This would do one pass over original data.

On Fri, Nov 28, 2014 at 8:21 AM, mrm  wrote:

> Hi,
>
> My question is:
>
> I have multiple filter operations where I split my initial rdd into two
> different groups. The two groups cover the whole initial set. In code, it's
> something like:
>
> set1 = initial.filter(lambda x: x == something)
> set2 = initial.filter(lambda x: x != something)
>
> By doing this, I am doing two passes over the data. Is there any way to
> optimise this to do it in a single pass?
>
> Note: I was trying to look in the mailing list to see if this question has
> been asked already, but could not find it.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>