Re: Compile Spark with Maven & Zinc Scala Plugin

2015-03-06 Thread Night Wolf
Tried with that. No luck. Same error on abt-interface jar. I can see maven
downloaded that jar into my .m2 cache

On Friday, March 6, 2015, 鹰 <980548...@qq.com> wrote:

> try it with mvn  -DskipTests -Pscala-2.11 clean install package


Store the shuffled files in memory using Tachyon

2015-03-06 Thread sara mustafa
Hi all,

Is it possible to store Spark shuffled files on a distributed memory like
Tachyon instead of spilling them to disk?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Store-the-shuffled-files-in-memory-using-Tachyon-tp21944.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-stream programme failed on yarn-client

2015-03-06 Thread Akhil Das
Looks like an issue with your yarn setup, could you try doing a simple
example with spark-shell?

Start the spark shell as:

$*MASTER=yarn-client bin/spark-shell*
*spark-shell> *sc.parallelize(1 to 1000).collect

​If that doesn't work, then make sure your yarn services are up and running
and in your spark-env.sh you may set the corresponding configurations from
the following:​


# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512
Mb)


Thanks
Best Regards

On Fri, Mar 6, 2015 at 1:09 PM, fenghaixiong <980548...@qq.com> wrote:

> Hi all,
>
>
> I'm try to write a spark stream programme so i read the spark
> online document ,according the document i write the programe like this :
>
> import org.apache.spark.SparkConf
> import org.apache.spark.streaming._
> import org.apache.spark.streaming.StreamingContext._
>
> object SparkStreamTest {
>   def main(args: Array[String]) {
> val conf = new SparkConf()
> val ssc = new StreamingContext(conf, Seconds(1))
> val lines = ssc.socketTextStream(args(0), args(1).toInt)
> val words = lines.flatMap(_.split(" "))
> val pairs = words.map(word => (word, 1))
> val wordCounts = pairs.reduceByKey(_ + _)
> wordCounts.print()
> ssc.start() // Start the computation
> ssc.awaitTermination()  // Wait for the computation to terminate
>   }
>
> }
>
>
>
> for test i first start listen a port by this:
>  nc -lk 
>
> and then i submit job by
> spark-submit  --master local[2] --class com.nd.hxf.SparkStreamTest
> spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 
>
> everything is okay
>
>
> but when i run it on yarn by this :
> spark-submit  --master yarn-client --class com.nd.hxf.SparkStreamTest
> spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 
>
> it wait for a longtime and repeat output somemessage a apart of the output
> is like this:
>
>
>
>
>
>
>
> 15/03/06 15:30:24 INFO YarnClientSchedulerBackend: SchedulerBackend is
> ready for scheduling beginning after waiting
> maxRegisteredResourcesWaitingTime: 3(ms)
> 15/03/06 15:30:24 INFO ReceiverTracker: ReceiverTracker started
> 15/03/06 15:30:24 INFO ForEachDStream: metadataCleanupDelay = -1
> 15/03/06 15:30:24 INFO ShuffledDStream: metadataCleanupDelay = -1
> 15/03/06 15:30:24 INFO MappedDStream: metadataCleanupDelay = -1
> 15/03/06 15:30:24 INFO FlatMappedDStream: metadataCleanupDelay = -1
> 15/03/06 15:30:24 INFO SocketInputDStream: metadataCleanupDelay = -1
> 15/03/06 15:30:24 INFO SocketInputDStream: Slide time = 1000 ms
> 15/03/06 15:30:24 INFO SocketInputDStream: Storage level =
> StorageLevel(false, false, false, false, 1)
> 15/03/06 15:30:24 INFO SocketInputDStream: Checkpoint interval = null
> 15/03/06 15:30:24 INFO SocketInputDStream: Remember duration = 1000 ms
> 15/03/06 15:30:24 INFO SocketInputDStream: Initialized and validated
> org.apache.spark.streaming.dstream.SocketInputDStream@b01c5f8
> 15/03/06 15:30:24 INFO FlatMappedDStream: Slide time = 1000 ms
> 15/03/06 15:30:24 INFO FlatMappedDStream: Storage level =
> StorageLevel(false, false, false, false, 1)
> 15/03/06 15:30:24 INFO FlatMappedDStream: Checkpoint interval = null
> 15/03/06 15:30:24 INFO FlatMappedDStream: Remember duration = 1000 ms
> 15/03/06 15:30:24 INFO FlatMappedDStream: Initialized and validated
> org.apache.spark.streaming.dstream.FlatMappedDStream@6bd47453
> 15/03/06 15:30:24 INFO MappedDStream: Slide time = 1000 ms
> 15/03/06 15:30:24 INFO MappedDStream: Storage level = StorageLevel(false,
> false, false, false, 1)
> 15/03/06 15:30:24 INFO MappedDStream: Checkpoint interval = null
> 15/03/06 15:30:24 INFO MappedDStream: Remember duration = 1000 ms
> 15/03/06 15:30:24 INFO MappedDStream: Initialized and validated
> org.apache.spark.streaming.dstream.MappedDStream@941451f
> 15/03/06 15:30:24 INFO ShuffledDStream: Slide time = 1000 ms
> 15/03/06 15:30:24 INFO ShuffledDStream: Storage level =
> StorageLevel(false, false, false, false, 1)
> 15/03/06 15:30:24 INFO ShuffledDStream: Checkpoint interval = null
> 15/03/06 15:30:24 INFO ShuffledDStream: Remember duration = 1000 ms
> 15/03/06 15:30:24 INFO ShuffledDStream: Initialized and validated
> org.apache.spark.streaming.dstream.ShuffledDStream@42eba6ee
> 15/03/06 15:30:24 INFO ForEachDStream: Slide time = 1000 ms
> 15/03/06 15:30:24 INFO ForEachDStream: Storage level = StorageLevel(false,
> false, false, false, 1)
> 15/03/06 15:30:24 INFO ForEachDStream: Checkpoint interval = null
> 15/03/06 15:30:24 INFO ForEachDStream: Remember duration = 1000 ms
> 15/03/06 15:30:24 INFO ForEachDStream: Initialized and validated
> org.apache.spark.streaming.dstream.ForEachDStream@48d166b5
> 15/03/06 15:

Re: spark-stream programme failed on yarn-client

2015-03-06 Thread fenghaixiong
Thanks ,you advise is usefull I just submit my job on my spark client which 
config with simple configure file so it failed 
when i run my job on service machine everything is okay 

 
On Fri, Mar 06, 2015 at 02:10:04PM +0530, Akhil Das wrote:
> Looks like an issue with your yarn setup, could you try doing a simple
> example with spark-shell?
> 
> Start the spark shell as:
> 
> $*MASTER=yarn-client bin/spark-shell*
> *spark-shell> *sc.parallelize(1 to 1000).collect
> 
> ​If that doesn't work, then make sure your yarn services are up and running
> and in your spark-env.sh you may set the corresponding configurations from
> the following:​
> 
> 
> # Options read in YARN client mode
> # - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
> # - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
> # - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
> # - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
> # - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512
> Mb)
> 
> 
> Thanks
> Best Regards
> 
> On Fri, Mar 6, 2015 at 1:09 PM, fenghaixiong <980548...@qq.com> wrote:
> 
> > Hi all,
> >
> >
> > I'm try to write a spark stream programme so i read the spark
> > online document ,according the document i write the programe like this :
> >
> > import org.apache.spark.SparkConf
> > import org.apache.spark.streaming._
> > import org.apache.spark.streaming.StreamingContext._
> >
> > object SparkStreamTest {
> >   def main(args: Array[String]) {
> > val conf = new SparkConf()
> > val ssc = new StreamingContext(conf, Seconds(1))
> > val lines = ssc.socketTextStream(args(0), args(1).toInt)
> > val words = lines.flatMap(_.split(" "))
> > val pairs = words.map(word => (word, 1))
> > val wordCounts = pairs.reduceByKey(_ + _)
> > wordCounts.print()
> > ssc.start() // Start the computation
> > ssc.awaitTermination()  // Wait for the computation to terminate
> >   }
> >
> > }
> >
> >
> >
> > for test i first start listen a port by this:
> >  nc -lk 
> >
> > and then i submit job by
> > spark-submit  --master local[2] --class com.nd.hxf.SparkStreamTest
> > spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 
> >
> > everything is okay
> >
> >
> > but when i run it on yarn by this :
> > spark-submit  --master yarn-client --class com.nd.hxf.SparkStreamTest
> > spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 
> >
> > it wait for a longtime and repeat output somemessage a apart of the output
> > is like this:
> >
> >
> >
> >
> >
> >
> >
> > 15/03/06 15:30:24 INFO YarnClientSchedulerBackend: SchedulerBackend is
> > ready for scheduling beginning after waiting
> > maxRegisteredResourcesWaitingTime: 3(ms)
> > 15/03/06 15:30:24 INFO ReceiverTracker: ReceiverTracker started
> > 15/03/06 15:30:24 INFO ForEachDStream: metadataCleanupDelay = -1
> > 15/03/06 15:30:24 INFO ShuffledDStream: metadataCleanupDelay = -1
> > 15/03/06 15:30:24 INFO MappedDStream: metadataCleanupDelay = -1
> > 15/03/06 15:30:24 INFO FlatMappedDStream: metadataCleanupDelay = -1
> > 15/03/06 15:30:24 INFO SocketInputDStream: metadataCleanupDelay = -1
> > 15/03/06 15:30:24 INFO SocketInputDStream: Slide time = 1000 ms
> > 15/03/06 15:30:24 INFO SocketInputDStream: Storage level =
> > StorageLevel(false, false, false, false, 1)
> > 15/03/06 15:30:24 INFO SocketInputDStream: Checkpoint interval = null
> > 15/03/06 15:30:24 INFO SocketInputDStream: Remember duration = 1000 ms
> > 15/03/06 15:30:24 INFO SocketInputDStream: Initialized and validated
> > org.apache.spark.streaming.dstream.SocketInputDStream@b01c5f8
> > 15/03/06 15:30:24 INFO FlatMappedDStream: Slide time = 1000 ms
> > 15/03/06 15:30:24 INFO FlatMappedDStream: Storage level =
> > StorageLevel(false, false, false, false, 1)
> > 15/03/06 15:30:24 INFO FlatMappedDStream: Checkpoint interval = null
> > 15/03/06 15:30:24 INFO FlatMappedDStream: Remember duration = 1000 ms
> > 15/03/06 15:30:24 INFO FlatMappedDStream: Initialized and validated
> > org.apache.spark.streaming.dstream.FlatMappedDStream@6bd47453
> > 15/03/06 15:30:24 INFO MappedDStream: Slide time = 1000 ms
> > 15/03/06 15:30:24 INFO MappedDStream: Storage level = StorageLevel(false,
> > false, false, false, 1)
> > 15/03/06 15:30:24 INFO MappedDStream: Checkpoint interval = null
> > 15/03/06 15:30:24 INFO MappedDStream: Remember duration = 1000 ms
> > 15/03/06 15:30:24 INFO MappedDStream: Initialized and validated
> > org.apache.spark.streaming.dstream.MappedDStream@941451f
> > 15/03/06 15:30:24 INFO ShuffledDStream: Slide time = 1000 ms
> > 15/03/06 15:30:24 INFO ShuffledDStream: Storage level =
> > StorageLevel(false, false, false, false, 1)
> > 15/03/06 15:30:24 INFO ShuffledDStream: Checkpoint interval = null
> > 15/03/06 15:30:24 INFO ShuffledDStream: Remember duration = 1000 ms
> > 15/03/06 15:30:24 INFO ShuffledDStream: Initialized and validated
> > org.apache.spark.streaming.dstre

Re: Compile Spark with Maven & Zinc Scala Plugin

2015-03-06 Thread fenghaixiong
you can read this document :
http://spark.apache.org/docs/latest/building-spark.html
this might can solve you question
and if you compile spark with maven you might need to set mave option like this 
 befor you start compile it
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

thanks
On Fri, Mar 06, 2015 at 07:10:34PM +1100, Night Wolf wrote:
> Tried with that. No luck. Same error on abt-interface jar. I can see maven
> downloaded that jar into my .m2 cache
> 
> On Friday, March 6, 2015, 鹰 <980548...@qq.com> wrote:
> 
> > try it with mvn  -DskipTests -Pscala-2.11 clean install package

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark code development practice

2015-03-06 Thread Sean Owen
Hm, why do you expect a factory method over a constructor? no, you
instantiate a SparkContext (if not working in the shell).

When you write your own program, you parse your own command line args.
--master yarn-client doesn't do anything unless you make it do so.
That is an arg to *Spark* programs.

In many cases these just end up setting a value on SparkConf. Not
always though. you would need to trace through programs like
SparkSubmit to see how to emulate the effect of some args.

Unless you really need to do this, you should not. Write an app that
you run with spark-submit.

On Fri, Mar 6, 2015 at 2:56 AM, Xi Shen  wrote:
> Thanks guys, this is very useful :)
>
> @Stephen, I know spark-shell will create a SC for me. But I don't understand
> why we still need to do "new SparkContext(...)" in our code. Shouldn't we
> get it from some where? e.g. "SparkContext.get".
>
> Another question, if I want my spark code to run in YARN later, how should I
> create the SparkContext? Or I can just specify "--marst yarn" on command
> line?
>
>
> Thanks,
> David
>
>
> On Fri, Mar 6, 2015 at 12:38 PM Koen Vantomme 
> wrote:
>>
>> use the spark-shell command and the shell will open
>> type :paste abd then paste your code, after control-d
>>
>> open spark-shell:
>> sparks/bin
>> ./spark-shell
>>
>> Verstuurd vanaf mijn iPhone
>>
>> Op 6-mrt.-2015 om 02:28 heeft "fightf...@163.com"  het
>> volgende geschreven:
>>
>> Hi,
>>
>> You can first establish a scala ide to develop and debug your spark
>> program, lets say, intellij idea or eclipse.
>>
>> Thanks,
>> Sun.
>>
>> 
>> fightf...@163.com
>>
>>
>> From: Xi Shen
>> Date: 2015-03-06 09:19
>> To: user@spark.apache.org
>> Subject: Spark code development practice
>> Hi,
>>
>> I am new to Spark. I see every spark program has a main() function. I
>> wonder if I can run the spark program directly, without using spark-submit.
>> I think it will be easier for early development and debug.
>>
>>
>> Thanks,
>> David
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Compile Spark with Maven & Zinc Scala Plugin

2015-03-06 Thread Sean Owen
Are you letting Spark download and run zinc for you? maybe that copy
is incomplete or corrupted. You can try removing the downloaded zinc
from build/ and try again.

Or run your own zinc.

On Fri, Mar 6, 2015 at 7:51 AM, Night Wolf  wrote:
> Hey,
>
> Trying to build latest spark 1.3 with Maven using
>
> -DskipTests clean install package
>
> But I'm getting errors with zinc, in the logs I see;
>
> [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
> spark-network-common_2.11 ---
>
>
> ...
>
> [error] Required file not found: sbt-interface.jar
> [error] See zinc -help for information about locating necessary files
>
>
> Any ideas?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Building Spark 1.3 for Scala 2.11 using Maven

2015-03-06 Thread Sean Owen
-Pscala-2.11 and -Dscala-2.11 will happen to do the same thing for this profile.

Why are you running "install package" and not just "install"? Probably
doesn't matter.

This sounds like you are trying to only build core without building
everything else, which you can't do in general unless you already
built and installed these snapshot artifacts locally.

On Fri, Mar 6, 2015 at 12:46 AM, Night Wolf  wrote:
> Hey guys,
>
> Trying to build Spark 1.3 for Scala 2.11.
>
> I'm running with the folllowng Maven command;
>
> -DskipTests -Dscala-2.11 clean install package
>
>
> Exception:
>
> [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve
> dependencies for project
> org.apache.spark:spark-core_2.10:jar:1.3.0-SNAPSHOT: The following artifacts
> could not be resolved:
> org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT,
> org.apache.spark:spark-network-shuffle_2.11:jar:1.3.0-SNAPSHOT: Failure to
> find org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT in
> http://repository.apache.org/snapshots was cached in the local repository,
> resolution will not be reattempted until the update interval of
> apache.snapshots has elapsed or updates are forced -> [Help 1]
>
>
> I see these warnings in the log before this error:
>
>
> [INFO]
> [INFO]
> 
> [INFO] Building Spark Project Core 1.3.0-SNAPSHOT
> [INFO]
> 
> [WARNING] The POM for
> org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT is missing, no
> dependency information available
> [WARNING] The POM for
> org.apache.spark:spark-network-shuffle_2.11:jar:1.3.0-SNAPSHOT is missing,
> no dependency information available
>
>
> Any ideas?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



No overwrite flag for saveAsXXFile

2015-03-06 Thread Jeff Zhang
Hi folks,

I found that RDD:saveXXFile has no overwrite flag which I think is very
helpful. Is there any reason for this ?



-- 
Best Regards

Jeff Zhang


Optimizing SQL Query

2015-03-06 Thread anu
I have a query that's like:

Could you help in providing me pointers as to how to start to optimize it
w.r.t. spark sql:


sqlContext.sql("

SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE

FROM (
   SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
SDP_USAGE

   FROM (
 SELECT *

 FROM date_d dd JOIN interval_f intf

 ON intf.DATE_WID = dd.WID

 WHERE intf.DATE_WID >= 20141116 AND
intf.DATE_WID <= 20141125 AND CAST(INTERVAL_END_TIME AS STRING) >=
'2014-11-16 00:00:00.000' AND  CAST(INTERVAL_END_TIME 
   AS STRING) <= '2014-11-26
00:00:00.000' AND MEAS_WID = 3

  ) test JOIN sdp_d sdp

   ON test.SDP_WID = sdp.WID

   WHERE sdp.UDC_ID = 'SP-1931201848'

   GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID

   ) dw

GROUP BY dw.DAY_OF_WEEK, dw.HOUR")



Currently the query takes 15 minutes execution time where interval_f table
holds approx 170GB of raw data, date_d --> 170 MB and sdp_d --> 490MB 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread James
Hello,

I want to execute a hql script through `spark-sql` command, my script
contains:

```
ALTER TABLE xxx
DROP PARTITION (date_key = ${hiveconf:CUR_DATE});
```

when I execute

```
spark-sql -f script.hql -hiveconf CUR_DATE=20150119
```

It throws an error like
```
cannot recognize input near '$' '{' 'hiveconf' in constant
```

I have try on hive and it works. Thus how could I pass a parameter like
date to a hql script?

Alcaid


Re: Optimizing SQL Query

2015-03-06 Thread daniel queiroz
Dude,

please, attach the execution plan of the query and details about the
indexes.



2015-03-06 9:07 GMT-03:00 anu :

> I have a query that's like:
>
> Could you help in providing me pointers as to how to start to optimize it
> w.r.t. spark sql:
>
>
> sqlContext.sql("
>
> SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE
>
> FROM (
>SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
> SDP_USAGE
>
>FROM (
>  SELECT *
>
>  FROM date_d dd JOIN interval_f intf
>
>  ON intf.DATE_WID = dd.WID
>
>  WHERE intf.DATE_WID >= 20141116 AND
> intf.DATE_WID <= 20141125 AND CAST(INTERVAL_END_TIME AS STRING) >=
> '2014-11-16 00:00:00.000' AND  CAST(INTERVAL_END_TIME
>AS STRING) <= '2014-11-26
> 00:00:00.000' AND MEAS_WID = 3
>
>   ) test JOIN sdp_d sdp
>
>ON test.SDP_WID = sdp.WID
>
>WHERE sdp.UDC_ID = 'SP-1931201848'
>
>GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID
>
>) dw
>
> GROUP BY dw.DAY_OF_WEEK, dw.HOUR")
>
>
>
> Currently the query takes 15 minutes execution time where interval_f table
> holds approx 170GB of raw data, date_d --> 170 MB and sdp_d --> 490MB
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: LBGFS optimizer performace

2015-03-06 Thread Gustavo Enrique Salazar Torres
Hi there:

Yeah, I came to that same conclusion after tuning spark sql shuffle
parameter. Also cut out some classes I was using to parse my dataset and
finally created schema only with the fields needed for my model (before
that I was creating it with 63 fields while I just needed 15).
So I came with this set of parameters:

--num-executors 200
--executor-memory 800M
--conf spark.executor.extraJavaOptions="-XX:+UseCompressedOops
-XX:+AggressiveOpts"
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.storage.memoryFraction=0.3
--conf spark.rdd.compress=true
--conf spark.sql.shuffle.partitions=4000
--driver-memory 4G

Now I processed 270 GB in 35 minutes and no OOM errors.
I have one question though: Does Spark SQL handle skewed tables? I was
wondering about that since my data has that feature and maybe there is more
room for performance improvement.

Thanks again.

Gustavo


On Thu, Mar 5, 2015 at 6:45 PM, DB Tsai  wrote:

> PS, I will recommend you compress the data when you cache the RDD.
> There will be some overhead in compression/decompression, and
> serialization/deserialization, but it will help a lot for iterative
> algorithms with ability to caching more data.
>
> Sincerely,
>
> DB Tsai
> ---
> Blog: https://www.dbtsai.com
>
>
> On Tue, Mar 3, 2015 at 2:27 PM, Gustavo Enrique Salazar Torres
>  wrote:
> > Yeah, I can call count before that and it works. Also I was over caching
> > tables but I removed those. Now there is no caching but it gets really
> slow
> > since it calculates my table RDD many times.
> > Also hacked the LBFGS code to pass the number of examples which I
> calculated
> > outside in a Spark SQL query but just moved the location of the problem.
> >
> > The query I'm running looks like this:
> >
> > s"SELECT $mappedFields, tableB.id as b_id FROM tableA LEFT JOIN tableB
> ON
> > tableA.id=tableB.table_a_id WHERE tableA.status='' OR tableB.status='' "
> >
> > mappedFields contains a list of fields which I'm interested in. The
> result
> > of that query goes through (including sampling) some transformations
> before
> > being input to LBFGS.
> >
> > My dataset has 180GB just for feature selection, I'm planning to use
> 450GB
> > to train the final model and I'm using 16 c3.2xlarge EC2 instances, that
> > means I have 240GB of RAM available.
> >
> > Any suggestion? I'm starting to check the algorithm because I don't
> > understand why it needs to count the dataset.
> >
> > Thanks
> >
> > Gustavo
> >
> > On Tue, Mar 3, 2015 at 6:08 PM, Joseph Bradley 
> > wrote:
> >>
> >> Is that error actually occurring in LBFGS?  It looks like it might be
> >> happening before the data even gets to LBFGS.  (Perhaps the outer join
> >> you're trying to do is making the dataset size explode a bit.)  Are you
> able
> >> to call count() (or any RDD action) on the data before you pass it to
> LBFGS?
> >>
> >> On Tue, Mar 3, 2015 at 8:55 AM, Gustavo Enrique Salazar Torres
> >>  wrote:
> >>>
> >>> Just did with the same error.
> >>> I think the problem is the "data.count()" call in LBFGS because for
> huge
> >>> datasets that's naive to do.
> >>> I was thinking to write my version of LBFGS but instead of doing
> >>> data.count() I will pass that parameter which I will calculate from a
> Spark
> >>> SQL query.
> >>>
> >>> I will let you know.
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Tue, Mar 3, 2015 at 3:25 AM, Akhil Das 
> >>> wrote:
> 
>  Can you try increasing your driver memory, reducing the executors and
>  increasing the executor memory?
> 
>  Thanks
>  Best Regards
> 
>  On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres
>   wrote:
> >
> > Hi there:
> >
> > I'm using LBFGS optimizer to train a logistic regression model. The
> > code I implemented follows the pattern showed in
> > https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but
> training
> > data is obtained from a Spark SQL RDD.
> > The problem I'm having is that LBFGS tries to count the elements in
> my
> > RDD and that results in a OOM exception since my dataset is huge.
> > I'm running on a AWS EMR cluster with 16 c3.2xlarge instances on
> Hadoop
> > YARN. My dataset is about 150 GB but I sample (I take only 1% of the
> data)
> > it in order to scale logistic regression.
> > The exception I'm getting is this:
> >
> > 15/03/03 04:21:44 WARN scheduler.TaskSetManager: Lost task 108.0 in
> > stage 2.0 (TID 7600, ip-10-155-20-71.ec2.internal):
> > java.lang.OutOfMemoryError: Java heap space
> > at java.util.Arrays.copyOfRange(Arrays.java:2694)
> > at java.lang.String.(String.java:203)
> > at
> > com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
> > at
> >
> com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
> >  

Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode

2015-03-06 Thread Zsolt Tóth
Hi,

I submit spark jobs in yarn-cluster mode remotely from java code by calling
Client.submitApplication(). For some reason I want to use 1.3.0 jars on the
client side (e.g spark-yarn_2.10-1.3.0.jar) but I have
spark-assembly-1.2.1* on the cluster.
The problem is that the ApplicationMaster can't find the user application
jar (specified with --jar option). I think this is because of changes in
the classpath population in the Client class.
Is it possible to make this setup work without changing the codebase or the
jars?

Cheers,
Zsolt


Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu
Adding support for overwrite flag would make saveAsXXFile more user friendly. 

Cheers



> On Mar 6, 2015, at 2:14 AM, Jeff Zhang  wrote:
> 
> Hi folks,
> 
> I found that RDD:saveXXFile has no overwrite flag which I think is very 
> helpful. Is there any reason for this ?
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Edmon Begoli
 Does Spark-SQL require installation of Hive for it to run correctly or not?

I could not tell from this statement:
https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

Thank you,
Edmon


Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Sean Owen
This was discussed in the past and viewed as dangerous to enable. The
biggest problem, by far, comes when you have a job that output M
partitions, 'overwriting' a directory of data containing N > M old
partitions. You suddenly have a mix of new and old data.

It doesn't match Hadoop's semantics either, which won't let you do
this. You can of course simply remove the output directory.

On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu  wrote:
> Adding support for overwrite flag would make saveAsXXFile more user friendly.
>
> Cheers
>
>
>
>> On Mar 6, 2015, at 2:14 AM, Jeff Zhang  wrote:
>>
>> Hi folks,
>>
>> I found that RDD:saveXXFile has no overwrite flag which I think is very 
>> helpful. Is there any reason for this ?
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Integer column in schema RDD from parquet being considered as string

2015-03-06 Thread gtinside
Hi tsingfu ,

Thanks for your reply, I tried with other columns but the problem is same
with other Integer columns.

Regards,
Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Integer-column-in-schema-RDD-from-parquet-being-considered-as-string-tp21917p21950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu
Found this thread:
http://search-hadoop.com/m/JW1q5HMrge2

Cheers

On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen  wrote:

> This was discussed in the past and viewed as dangerous to enable. The
> biggest problem, by far, comes when you have a job that output M
> partitions, 'overwriting' a directory of data containing N > M old
> partitions. You suddenly have a mix of new and old data.
>
> It doesn't match Hadoop's semantics either, which won't let you do
> this. You can of course simply remove the output directory.
>
> On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu  wrote:
> > Adding support for overwrite flag would make saveAsXXFile more user
> friendly.
> >
> > Cheers
> >
> >
> >
> >> On Mar 6, 2015, at 2:14 AM, Jeff Zhang  wrote:
> >>
> >> Hi folks,
> >>
> >> I found that RDD:saveXXFile has no overwrite flag which I think is very
> helpful. Is there any reason for this ?
> >>
> >>
> >>
> >> --
> >> Best Regards
> >>
> >> Jeff Zhang
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Yin Huai
Hi Edmon,

No, you do not need to install Hive to use Spark SQL.

Thanks,

Yin

On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli  wrote:

>  Does Spark-SQL require installation of Hive for it to run correctly or
> not?
>
> I could not tell from this statement:
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
>
> Thank you,
> Edmon
>


Data Frame types

2015-03-06 Thread Cesar Flores
The SchemaRDD supports the storage of user defined classes. However, in
order to do that, the user class needs to extends the UserDefinedType interface
(see for example VectorUDT in org.apache.spark.mllib.linalg).

My question is: Do the new Data Frame Structure (to be released in spark
1.3) will be able to handle user defined classes too? Do user classes will
need to extend they will need to define the same approach?


-- 
Cesar Flores


Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Nan Zhu
Actually, except setting spark.hadoop.validateOutputSpecs to false to disable 
output validation for the whole program  

Spark implementation uses a Dynamic Variable (object PairRDDFunctions) 
internally to disable it in a case-by-case manner

val disableOutputSpecValidation: DynamicVariable[Boolean] = new 
DynamicVariable[Boolean](false)

I’m not sure if there is enough amount of benefits to make it worth exposing 
this variable to the user…  

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, March 6, 2015 at 10:22 AM, Ted Yu wrote:

> Found this thread:
> http://search-hadoop.com/m/JW1q5HMrge2
>  
> Cheers
>  
> On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen  (mailto:so...@cloudera.com)> wrote:
> > This was discussed in the past and viewed as dangerous to enable. The
> > biggest problem, by far, comes when you have a job that output M
> > partitions, 'overwriting' a directory of data containing N > M old
> > partitions. You suddenly have a mix of new and old data.
> >  
> > It doesn't match Hadoop's semantics either, which won't let you do
> > this. You can of course simply remove the output directory.
> >  
> > On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu  > (mailto:yuzhih...@gmail.com)> wrote:
> > > Adding support for overwrite flag would make saveAsXXFile more user 
> > > friendly.
> > >
> > > Cheers
> > >
> > >
> > >
> > >> On Mar 6, 2015, at 2:14 AM, Jeff Zhang  > >> (mailto:zjf...@gmail.com)> wrote:
> > >>
> > >> Hi folks,
> > >>
> > >> I found that RDD:saveXXFile has no overwrite flag which I think is very 
> > >> helpful. Is there any reason for this ?
> > >>
> > >>
> > >>
> > >> --
> > >> Best Regards
> > >>
> > >> Jeff Zhang
> > >
> > > -
> > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > > (mailto:user-unsubscr...@spark.apache.org)
> > > For additional commands, e-mail: user-h...@spark.apache.org 
> > > (mailto:user-h...@spark.apache.org)
> > >
>  



Re: Data Frame types

2015-03-06 Thread Jaonary Rabarisoa
Hi Cesar,

Yes, you can define an UDT with the new DataFrame, the same way that
SchemaRDD did.

Jaonary

On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores  wrote:

>
> The SchemaRDD supports the storage of user defined classes. However, in
> order to do that, the user class needs to extends the UserDefinedType 
> interface
> (see for example VectorUDT in org.apache.spark.mllib.linalg).
>
> My question is: Do the new Data Frame Structure (to be released in spark
> 1.3) will be able to handle user defined classes too? Do user classes will
> need to extend they will need to define the same approach?
>
>
> --
> Cesar Flores
>


spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
currently spark provides many excellent algorithms for operations per key
as long as the data send to the reducers per key fits in memory. operations
like combineByKey, reduceByKey and foldByKey rely on pushing the operation
map-side so that the data reduce-side is small. and groupByKey simply
requires that the values per key fit in memory.

but there are algorithms for which we would like to process all the values
per key reduce-side, even when they do not fit in memory. examples are
algorithms that need to process the values ordered, or algorithms that need
to emit all values again. basically this is what the original hadoop reduce
operation did so well: it allowed sorting of values (using secondary sort),
and it processed all values per key in a streaming fashion.

the library spark-sorted aims to bring these kind of operations back to
spark, by providing a way to process values with a user provided
Ordering[V] and a user provided streaming operation Iterator[V] =>
Iterator[W]. it does not make the assumption that the values need to fit in
memory per key.

the basic idea is to rely on spark's sort-based shuffle to re-arrange the
data so that all values for a given key are placed consecutively within a
single partition, and then process them using a map-like operation.

you can find the project here:
https://github.com/tresata/spark-sorted

the project is in a very early stage. any feedback is very much appreciated.


Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
Hi Shivaram,

Thank you for the link. I'm trying to figure out how can I port this to
mllib. May you can help me to understand how pieces fit together.
Currently, in mllib there's different types of distributed matrix :

BlockMatrix, CoordinateMatrix, IndexedRowMatrix and RowMatrix. Which one
should correspond to RowPartitionedMatrix in ml-matrix ?



On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> There are couple of solvers that I've written that is part of the AMPLab
> ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
> interested in porting them I'd be happy to review it
>
> Thanks
> Shivaram
>
>
> [1]
> https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
> [2]
> https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala
>
> On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa 
> wrote:
>
>> Dear all,
>>
>> Is there a least square solver based on DistributedMatrix that we can use
>> out of the box in the current (or the master) version of spark ?
>> It seems that the only least square solver available in spark is private
>> to recommender package.
>>
>>
>> Cheers,
>>
>> Jao
>>
>
>


Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
Do you have a reference paper to the implemented algorithm in TSQR.scala ?

On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> There are couple of solvers that I've written that is part of the AMPLab
> ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
> interested in porting them I'd be happy to review it
>
> Thanks
> Shivaram
>
>
> [1]
> https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
> [2]
> https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala
>
> On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa 
> wrote:
>
>> Dear all,
>>
>> Is there a least square solver based on DistributedMatrix that we can use
>> out of the box in the current (or the master) version of spark ?
>> It seems that the only least square solver available in spark is private
>> to recommender package.
>>
>>
>> Cheers,
>>
>> Jao
>>
>
>


Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu
Since we already have "spark.hadoop.validateOutputSpecs" config, I think
there is not much need to expose disableOutputSpecValidation

Cheers

On Fri, Mar 6, 2015 at 7:34 AM, Nan Zhu  wrote:

>  Actually, except setting spark.hadoop.validateOutputSpecs to false to
> disable output validation for the whole program
>
> Spark implementation uses a Dynamic Variable (object PairRDDFunctions)
> internally to disable it in a case-by-case manner
>
> val disableOutputSpecValidation: DynamicVariable[Boolean] = new 
> DynamicVariable[Boolean](false)
>
>
> I’m not sure if there is enough amount of benefits to make it worth exposing 
> this variable to the user…
>
>
> Best,
>
>
> --
> Nan Zhu
> http://codingcat.me
>
> On Friday, March 6, 2015 at 10:22 AM, Ted Yu wrote:
>
> Found this thread:
> http://search-hadoop.com/m/JW1q5HMrge2
>
> Cheers
>
> On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen  wrote:
>
> This was discussed in the past and viewed as dangerous to enable. The
> biggest problem, by far, comes when you have a job that output M
> partitions, 'overwriting' a directory of data containing N > M old
> partitions. You suddenly have a mix of new and old data.
>
> It doesn't match Hadoop's semantics either, which won't let you do
> this. You can of course simply remove the output directory.
>
> On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu  wrote:
> > Adding support for overwrite flag would make saveAsXXFile more user
> friendly.
> >
> > Cheers
> >
> >
> >
> >> On Mar 6, 2015, at 2:14 AM, Jeff Zhang  wrote:
> >>
> >> Hi folks,
> >>
> >> I found that RDD:saveXXFile has no overwrite flag which I think is very
> helpful. Is there any reason for this ?
> >>
> >>
> >>
> >> --
> >> Best Regards
> >>
> >> Jeff Zhang
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
>
>


Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Shivaram Venkataraman
Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a
good reference

Shivaram
On Mar 6, 2015 9:17 AM, "Jaonary Rabarisoa"  wrote:

> Do you have a reference paper to the implemented algorithm in TSQR.scala ?
>
> On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> There are couple of solvers that I've written that is part of the AMPLab
>> ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
>> interested in porting them I'd be happy to review it
>>
>> Thanks
>> Shivaram
>>
>>
>> [1]
>> https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
>> [2]
>> https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala
>>
>> On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa 
>> wrote:
>>
>>> Dear all,
>>>
>>> Is there a least square solver based on DistributedMatrix that we can
>>> use out of the box in the current (or the master) version of spark ?
>>> It seems that the only least square solver available in spark is private
>>> to recommender package.
>>>
>>>
>>> Cheers,
>>>
>>> Jao
>>>
>>
>>
>


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
Its not required, but even if you don't have hive installed you probably
still want to use the HiveContext.  From earlier in that doc:

In addition to the basic SQLContext, you can also create a HiveContext,
> which provides a superset of the functionality provided by the basic
> SQLContext. Additional features include the ability to write queries using
> the more complete HiveQL parser, access to HiveUDFs, and the ability to
> read data from Hive tables. To use a HiveContext, *you do not need to
> have an existing Hive setup*, and all of the data sources available to a
> SQLContext are still available. HiveContext is only packaged separately to
> avoid including all of Hive’s dependencies in the default Spark build. If
> these dependencies are not a problem for your application then using
> HiveContext is recommended for the 1.2 release of Spark. Future releases
> will focus on bringing SQLContext up to feature parity with a HiveContext.


On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai  wrote:

> Hi Edmon,
>
> No, you do not need to install Hive to use Spark SQL.
>
> Thanks,
>
> Yin
>
> On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli  wrote:
>
>>  Does Spark-SQL require installation of Hive for it to run correctly or
>> not?
>>
>> I could not tell from this statement:
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
>>
>> Thank you,
>> Edmon
>>
>
>


Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist
First, thanks to everyone for their assistance and recommendations.

@Marcelo

I applied the patch that you recommended and am now able to get into the
shell, thank you worked great after I realized that the pom was pointing to
the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.

@Zhan

Need to apply this patch next.  I tried to start the spark-thriftserver but
and it starts, then fails with like this:  I have the entries in my
spark-default.conf, but not the patch applied.

./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m
--hiveconf hive.server2.thrift.port=10001

5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at
http://hadoopdev01.opsdatastore.com:404015/03/06 12:34:18 INFO
impl.TimelineClientImpl: Timeline service address:
http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/15/03/06
12:34:18 INFO client.RMProxy: Connecting to ResourceManager at
hadoopdev02.opsdatastore.com/192.168.15.154:805015/03/06 12:34:18 INFO
yarn.Client: Requesting a new application from cluster with 4
NodeManagers15/03/06 12:34:18 INFO yarn.Client: Verifying our
application has not requested more than the maximum memory capability
of the cluster (8192 MB per container)15/03/06 12:34:18 INFO
yarn.Client: Will allocate AM container, with 896 MB memory including
384 MB overhead15/03/06 12:34:18 INFO yarn.Client: Setting up
container launch context for our AM15/03/06 12:34:18 INFO yarn.Client:
Preparing resources for our AM container15/03/06 12:34:19 WARN
shortcircuit.DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.15/03/06
12:34:19 INFO yarn.Client: Uploading resource
file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar
-> 
hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar15/03/06
12:34:21 INFO yarn.Client: Setting up the launch environment for our
AM container15/03/06 12:34:21 INFO spark.SecurityManager: Changing
view acls to: root15/03/06 12:34:21 INFO spark.SecurityManager:
Changing modify acls to: root15/03/06 12:34:21 INFO
spark.SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(root); users with
modify permissions: Set(root)15/03/06 12:34:21 INFO yarn.Client:
Submitting application 18 to ResourceManager15/03/06 12:34:21 INFO
impl.YarnClientImpl: Submitted application
application_1425078697953_001815/03/06 12:34:22 INFO yarn.Client:
Application report for application_1425078697953_0018 (state:
ACCEPTED)15/03/06 12:34:22 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL:
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
 user: root15/03/06 12:34:23 INFO yarn.Client: Application report
for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:24
INFO yarn.Client: Application report for
application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:25 INFO
yarn.Client: Application report for application_1425078697953_0018
(state: ACCEPTED)15/03/06 12:34:26 INFO yarn.Client: Application
report for application_1425078697953_0018 (state: ACCEPTED)15/03/06
12:34:27 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster
registered as 
Actor[akka.tcp://sparkyar...@hadoopdev08.opsdatastore.com:40201/user/YarnAM#-557112763]15/03/06
12:34:27 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter.
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,
Map(PROXY_HOSTS -> hadoopdev02.opsdatastore.com, PROXY_URI_BASES ->
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018),
/proxy/application_1425078697953_001815/03/06 12:34:27 INFO
ui.JettyUtils: Adding filter:
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter15/03/06
12:34:27 INFO yarn.Client: Application report for
application_1425078697953_0018 (state: RUNNING)15/03/06 12:34:27 INFO
yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: hadoopdev08.opsdatastore.com
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL:
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
 user: root15/03/06 12:34:27 INFO
cluster.YarnClientSchedulerBackend: Application
application_1425078697953_0018 has started running.15/03/06 12:34:28
INFO netty.NettyBlockTransferService: Server created on 4612415/03/06
12:34:28 INFO storage.BlockManagerMaster: Trying to register
BlockManager15/03/06 12:34:28 INFO storage.BlockManagerMasterActor:
Registering block manager hadoopdev01.opsdatastore.com:46124 with
265.4 MB RAM, BlockManagerId(, hadoopdev01.opsdatastore.com,
46124)15/03/06 12:34:28 INFO storage.BlockManagerMaster: Registered
BlockManager15/03/06 1

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
Hi Todd,

Looks like the thrift server can connect to metastore, but something wrong in 
the executors. You can try to get the log with "yarn logs -applicationID xxx” 
to check why it failed. If there is no log (master or executor is not started 
at all), you can go to the RM webpage, click the link to see why the shell 
failed in the first place.

Thanks.

Zhan Zhang

On Mar 6, 2015, at 9:59 AM, Todd Nist 
mailto:tsind...@gmail.com>> wrote:

First, thanks to everyone for their assistance and recommendations.

@Marcelo

I applied the patch that you recommended and am now able to get into the shell, 
thank you worked great after I realized that the pom was pointing to the 
1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.

@Zhan

Need to apply this patch next.  I tried to start the spark-thriftserver but and 
it starts, then fails with like this:  I have the entries in my 
spark-default.conf, but not the patch applied.


./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m --hiveconf 
hive.server2.thrift.port=10001

5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at 
http://hadoopdev01.opsdatastore.com:4040
15/03/06 12:34:18 INFO impl.TimelineClientImpl: Timeline service address: 
http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/
15/03/06 12:34:18 INFO client.RMProxy: Connecting to ResourceManager at 
hadoopdev02.opsdatastore.com/192.168.15.154:8050
15/03/06 12:34:18 INFO yarn.Client: Requesting a new application from cluster 
with 4 NodeManagers
15/03/06 12:34:18 INFO yarn.Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (8192 MB per container)
15/03/06 12:34:18 INFO yarn.Client: Will allocate AM container, with 896 MB 
memory including 384 MB overhead
15/03/06 12:34:18 INFO yarn.Client: Setting up container launch context for our 
AM
15/03/06 12:34:18 INFO yarn.Client: Preparing resources for our AM container
15/03/06 12:34:19 WARN shortcircuit.DomainSocketFactory: The short-circuit 
local reads feature cannot be used because libhadoop cannot be loaded.
15/03/06 12:34:19 INFO yarn.Client: Uploading resource 
file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar 
-> 
hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar
15/03/06 12:34:21 INFO yarn.Client: Setting up the launch environment for our 
AM container
15/03/06 12:34:21 INFO spark.SecurityManager: Changing view acls to: root
15/03/06 12:34:21 INFO spark.SecurityManager: Changing modify acls to: root
15/03/06 12:34:21 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
15/03/06 12:34:21 INFO yarn.Client: Submitting application 18 to ResourceManager
15/03/06 12:34:21 INFO impl.YarnClientImpl: Submitted application 
application_1425078697953_0018
15/03/06 12:34:22 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:22 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL: 
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
 user: root
15/03/06 12:34:23 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:24 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:25 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:26 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster 
registered as 
Actor[akka.tcp://sparkyar...@hadoopdev08.opsdatastore.com:40201/user/YarnAM#-557112763]
15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> 
hadoopdev02.opsdatastore.com, PROXY_URI_BASES -> 
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018),
 /proxy/application_1425078697953_0018
15/03/06 12:34:27 INFO ui.JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/03/06 12:34:27 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: RUNNING)
15/03/06 12:34:27 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: hadoopdev08.opsdatastore.com
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL: 
http://hadoopdev02.opsdatastore.com:80

Re: Visualize Spark Job

2015-03-06 Thread Phuoc Do
I have this PR submitted. You can merge it and try.

https://github.com/apache/spark/pull/2077

On Thu, Jan 15, 2015 at 12:50 AM, Kuromatsu, Nobuyuki <
n.kuroma...@jp.fujitsu.com> wrote:

> Hi
>
> I want to visualize tasks and stages in order to analyze spark jobs.
> I know necessary metrics is written in spark.eventLog.dir.
>
> Does anyone know the tool like swimlanes in Tez?
>
> Regards,
>
> Nobuyuki Kuromatsu
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Phuoc Do
https://vida.io/dnprock


SparkSQL supports hive "insert overwrite directory"?

2015-03-06 Thread ogoh
Hello,
I am using Spark 1.2.1 along with Hive 0.13.1.
I run some hive queries by using beeline and Thriftserver. 
Queries I tested so far worked well except the followings:
I want to export the query output into a file at either HDFS or local fs
(ideally local fs).
There are not yet supported?
The spark github has already unit tests using "insert overwrite directory"
in
https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/insert_overwrite_local_directory_1.q.

$insert overwrite directory '' select * from temptable; 
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
temptable
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
'/user/ogoh/table'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF

scala.NotImplementedError: No parse rules for:
 TOK_DESTINATION
  TOK_DIR
'/user/bob/table'

$insert overwrite local directory '' select * from
temptable; ;
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
temptable
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
"/user/bob/table"
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF

scala.NotImplementedError: No parse rules for:
 TOK_DESTINATION
  TOK_LOCAL_DIR
"/user/ogoh/table"

Thanks,
Okehee



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-supports-hive-insert-overwrite-directory-tp21951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



takeSample triggers 2 jobs

2015-03-06 Thread Rares Vernica
Hello,

I am using takeSample from the Scala Spark 1.2.1 shell:

scala> sc.textFile("README.md").takeSample(false, 3)


and I notice that two jobs are generated on the Spark Jobs page:

Job Id Description
1 takeSample at :13
0  takeSample at :13


Any ideas why the two jobs are needed?

Thanks!
Rares


Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread Zhan Zhang
Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash)

Thanks.

Zhan Zhang

On Mar 6, 2015, at 4:20 AM, James  wrote:

> Hello, 
> 
> I want to execute a hql script through `spark-sql` command, my script 
> contains: 
> 
> ```
> ALTER TABLE xxx 
> DROP PARTITION (date_key = ${hiveconf:CUR_DATE});
> ```
> 
> when I execute 
> 
> ```
> spark-sql -f script.hql -hiveconf CUR_DATE=20150119
> ```
> 
> It throws an error like 
> ```
> cannot recognize input near '$' '{' 'hiveconf' in constant
> ```
> 
> I have try on hive and it works. Thus how could I pass a parameter like date 
> to a hql script?
> 
> Alcaid


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Help with transformWith in SparkStreaming

2015-03-06 Thread Laeeq Ahmed
Hi,
I am filtering first DStream with the value in second DStream. I also want to 
keep the value of second Dstream. I have done the following and having problem 
with returning new RDD:
val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: 
RDD[(String,String)], rdd2 : RDD[Int]) => {                                     
                                   var first = " "; var second = " "; var third 
= 0                                                                          if 
(rdd2.first<=3)                                                                 
               {                                                                
                        first = rdd1.map(_._1).first                            
                                                            second = 
rdd1.map(_._2).first                                                            
                            third = rdd2.first                                  
                                              }                                 
                                       RDD[(first,second,third)]                
                                                               })
ERROR/home/hduser/Projects/scalaad/src/main/scala/eeg/anomd/StreamAnomalyDetector.scala:119:
 error: not found: value RDD[ERROR]  RDD[(first,second,third)] 
I am imported the import org.apache.spark.rdd.RDD

Regards,Laeeq


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread sandeep vura
Hi ,

For creating a Hive table do i need to add hive-site.xml in spark/conf
directory.

On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust 
wrote:

> Its not required, but even if you don't have hive installed you probably
> still want to use the HiveContext.  From earlier in that doc:
>
> In addition to the basic SQLContext, you can also create a HiveContext,
>> which provides a superset of the functionality provided by the basic
>> SQLContext. Additional features include the ability to write queries using
>> the more complete HiveQL parser, access to HiveUDFs, and the ability to
>> read data from Hive tables. To use a HiveContext, *you do not need to
>> have an existing Hive setup*, and all of the data sources available to a
>> SQLContext are still available. HiveContext is only packaged separately to
>> avoid including all of Hive’s dependencies in the default Spark build. If
>> these dependencies are not a problem for your application then using
>> HiveContext is recommended for the 1.2 release of Spark. Future releases
>> will focus on bringing SQLContext up to feature parity with a HiveContext.
>
>
> On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai  wrote:
>
>> Hi Edmon,
>>
>> No, you do not need to install Hive to use Spark SQL.
>>
>> Thanks,
>>
>> Yin
>>
>> On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli  wrote:
>>
>>>  Does Spark-SQL require installation of Hive for it to run correctly or
>>> not?
>>>
>>> I could not tell from this statement:
>>>
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
>>>
>>> Thank you,
>>> Edmon
>>>
>>
>>
>


Re: takeSample triggers 2 jobs

2015-03-06 Thread Denny Lee
Hi Rares,

If you dig into the descriptions for the two jobs, it will probably return
something like:

Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:22)
...

Job ID: 0
org.apache.spark.rdd.RDD.takeSample(RDD.scala:428)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:22)
...

The code for Spark from the git copy of master at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Basically, line 428 refers to
val initialCount = this.count()

And liine 447 refers to
var samples = this.sample(withReplacement, fraction,
rand.nextInt()).collect()

Basically, the first job is getting the count so you can do the second job
which is to generate the samples.

HTH!
Denny




On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica  wrote:

> Hello,
>
> I am using takeSample from the Scala Spark 1.2.1 shell:
>
> scala> sc.textFile("README.md").takeSample(false, 3)
>
>
> and I notice that two jobs are generated on the Spark Jobs page:
>
> Job Id Description
> 1 takeSample at :13
> 0  takeSample at :13
>
>
> Any ideas why the two jobs are needed?
>
> Thanks!
> Rares
>


Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist
Hi Zhan,

I applied the patch you recommended,
https://github.com/apache/spark/pull/3409, it it now works. It was failing
with this:

Exception message:
/hadoop/yarn/local/usercache/root/appcache/application_1425078697953_0020/container_1425078697953_0020_01_02/launch_container.sh:
line 14:
$PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/
*${hdp.version}*/hadoop/lib/hadoop-lzo-0.6.0.*${hdp.version}*.jar:/etc/hadoop/conf/secure:$PWD/__app__.jar:$PWD/*:
*bad substitution*

While the spark-default.conf has these defined:

spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041


without the patch *${hdp.version} * was not being substituted.

Thanks for pointing me to that patch, appreciate it.

-Todd

On Fri, Mar 6, 2015 at 1:12 PM, Zhan Zhang  wrote:

>  Hi Todd,
>
>  Looks like the thrift server can connect to metastore, but something
> wrong in the executors. You can try to get the log with "yarn logs
> -applicationID xxx” to check why it failed. If there is no log (master or
> executor is not started at all), you can go to the RM webpage, click the
> link to see why the shell failed in the first place.
>
>  Thanks.
>
>  Zhan Zhang
>
>  On Mar 6, 2015, at 9:59 AM, Todd Nist  wrote:
>
>  First, thanks to everyone for their assistance and recommendations.
>
>  @Marcelo
>
>  I applied the patch that you recommended and am now able to get into the
> shell, thank you worked great after I realized that the pom was pointing to
> the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.
>
>  @Zhan
>
>  Need to apply this patch next.  I tried to start the spark-thriftserver
> but and it starts, then fails with like this:  I have the entries in my
> spark-default.conf, but not the patch applied.
>
>   ./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m 
> --hiveconf hive.server2.thrift.port=10001
>
>  5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at 
> http://hadoopdev01.opsdatastore.com:404015/03/06 12:34:18 INFO 
> impl.TimelineClientImpl: Timeline service address: 
> http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/15/03/06 12:34:18 
> INFO client.RMProxy: Connecting to ResourceManager at 
> hadoopdev02.opsdatastore.com/192.168.15.154:805015/03/06 12:34:18 INFO 
> yarn.Client: Requesting a new application from cluster with 4 
> NodeManagers15/03/06 12:34:18 INFO yarn.Client: Verifying our application has 
> not requested more than the maximum memory capability of the cluster (8192 MB 
> per container)15/03/06 12:34:18 INFO yarn.Client: Will allocate AM container, 
> with 896 MB memory including 384 MB overhead15/03/06 12:34:18 INFO 
> yarn.Client: Setting up container launch context for our AM15/03/06 12:34:18 
> INFO yarn.Client: Preparing resources for our AM container15/03/06 12:34:19 
> WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature 
> cannot be used because libhadoop cannot be loaded.15/03/06 12:34:19 INFO 
> yarn.Client: Uploading resource 
> file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar 
> -> 
> hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar15/03/06
>  12:34:21 INFO yarn.Client: Setting up the launch environment for our AM 
> container15/03/06 12:34:21 INFO spark.SecurityManager: Changing view acls to: 
> root15/03/06 12:34:21 INFO spark.SecurityManager: Changing modify acls to: 
> root15/03/06 12:34:21 INFO spark.SecurityManager: SecurityManager: 
> authentication disabled; ui acls disabled; users with view permissions: 
> Set(root); users with modify permissions: Set(root)15/03/06 12:34:21 INFO 
> yarn.Client: Submitting application 18 to ResourceManager15/03/06 12:34:21 
> INFO impl.YarnClientImpl: Submitted application 
> application_1425078697953_001815/03/06 12:34:22 INFO yarn.Client: Application 
> report for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:22 
> INFO yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: N/A
>  ApplicationMaster RPC port: -1
>  queue: default
>  start time: 1425663261755
>  final status: UNDEFINED
>  tracking URL: 
> http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
>  user: root15/03/06 12:34:23 INFO yarn.Client:

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Burak Yavuz
Hi Koert,

Would you like to register this on spark-packages.org?

Burak

On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers  wrote:

> currently spark provides many excellent algorithms for operations per key
> as long as the data send to the reducers per key fits in memory. operations
> like combineByKey, reduceByKey and foldByKey rely on pushing the operation
> map-side so that the data reduce-side is small. and groupByKey simply
> requires that the values per key fit in memory.
>
> but there are algorithms for which we would like to process all the values
> per key reduce-side, even when they do not fit in memory. examples are
> algorithms that need to process the values ordered, or algorithms that need
> to emit all values again. basically this is what the original hadoop reduce
> operation did so well: it allowed sorting of values (using secondary sort),
> and it processed all values per key in a streaming fashion.
>
> the library spark-sorted aims to bring these kind of operations back to
> spark, by providing a way to process values with a user provided
> Ordering[V] and a user provided streaming operation Iterator[V] =>
> Iterator[W]. it does not make the assumption that the values need to fit in
> memory per key.
>
> the basic idea is to rely on spark's sort-based shuffle to re-arrange the
> data so that all values for a given key are placed consecutively within a
> single partition, and then process them using a map-like operation.
>
> you can find the project here:
> https://github.com/tresata/spark-sorted
>
> the project is in a very early stage. any feedback is very much
> appreciated.
>
>
>
>


Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
Sorry. Misunderstanding. Looks like it already worked. If you still met some 
hdp.version problem, you can try it :)

Thanks.

Zhan Zhang

On Mar 6, 2015, at 11:40 AM, Zhan Zhang 
mailto:zzh...@hortonworks.com>> wrote:

You are using 1.2.1 right? If so, please add java-opts  in conf directory and 
give it a try.

[root@c6401 conf]# more java-opts
  -Dhdp.version=2.2.2.0-2041

Thanks.

Zhan Zhang

On Mar 6, 2015, at 11:35 AM, Todd Nist 
mailto:tsind...@gmail.com>> wrote:

 -Dhdp.version=2.2.0.0-2041




Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
You are using 1.2.1 right? If so, please add java-opts  in conf directory and 
give it a try.

[root@c6401 conf]# more java-opts
  -Dhdp.version=2.2.2.0-2041

Thanks.

Zhan Zhang

On Mar 6, 2015, at 11:35 AM, Todd Nist 
mailto:tsind...@gmail.com>> wrote:

 -Dhdp.version=2.2.0.0-2041



Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist
Working great now, after applying that patch; thanks again.

On Fri, Mar 6, 2015 at 2:42 PM, Zhan Zhang  wrote:

>  Sorry. Misunderstanding. Looks like it already worked. If you still met
> some hdp.version problem, you can try it :)
>
>  Thanks.
>
>  Zhan Zhang
>
>  On Mar 6, 2015, at 11:40 AM, Zhan Zhang  wrote:
>
>  You are using 1.2.1 right? If so, please add java-opts  in conf
> directory and give it a try.
>
>  [root@c6401 conf]# more java-opts
>   -Dhdp.version=2.2.2.0-2041
>
>  Thanks.
>
>  Zhan Zhang
>
>  On Mar 6, 2015, at 11:35 AM, Todd Nist  wrote:
>
>  -Dhdp.version=2.2.0.0-2041
>
>
>
>


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
Only if you want to configure the connection to an existing hive metastore.

On Fri, Mar 6, 2015 at 11:08 AM, sandeep vura  wrote:

> Hi ,
>
> For creating a Hive table do i need to add hive-site.xml in spark/conf
> directory.
>
> On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust 
> wrote:
>
>> Its not required, but even if you don't have hive installed you probably
>> still want to use the HiveContext.  From earlier in that doc:
>>
>> In addition to the basic SQLContext, you can also create a HiveContext,
>>> which provides a superset of the functionality provided by the basic
>>> SQLContext. Additional features include the ability to write queries using
>>> the more complete HiveQL parser, access to HiveUDFs, and the ability to
>>> read data from Hive tables. To use a HiveContext, *you do not need to
>>> have an existing Hive setup*, and all of the data sources available to
>>> a SQLContext are still available. HiveContext is only packaged separately
>>> to avoid including all of Hive’s dependencies in the default Spark build.
>>> If these dependencies are not a problem for your application then using
>>> HiveContext is recommended for the 1.2 release of Spark. Future releases
>>> will focus on bringing SQLContext up to feature parity with a HiveContext.
>>
>>
>> On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai  wrote:
>>
>>> Hi Edmon,
>>>
>>> No, you do not need to install Hive to use Spark SQL.
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli  wrote:
>>>
  Does Spark-SQL require installation of Hive for it to run correctly or
 not?

 I could not tell from this statement:

 https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

 Thank you,
 Edmon

>>>
>>>
>>
>


Re: Data Frame types

2015-03-06 Thread Michael Armbrust
No, the UDT API is not a public API as we have not stabilized the
implementation.  For this reason its only accessible to projects inside of
Spark.

On Fri, Mar 6, 2015 at 8:25 AM, Jaonary Rabarisoa  wrote:

> Hi Cesar,
>
> Yes, you can define an UDT with the new DataFrame, the same way that
> SchemaRDD did.
>
> Jaonary
>
> On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores  wrote:
>
>>
>> The SchemaRDD supports the storage of user defined classes. However, in
>> order to do that, the user class needs to extends the UserDefinedType 
>> interface
>> (see for example VectorUDT in org.apache.spark.mllib.linalg).
>>
>> My question is: Do the new Data Frame Structure (to be released in spark
>> 1.3) will be able to handle user defined classes too? Do user classes will
>> need to extend they will need to define the same approach?
>>
>>
>> --
>> Cesar Flores
>>
>
>


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
On Fri, Mar 6, 2015 at 11:56 AM, sandeep vura  wrote:

> Yes i want to link with existing hive metastore. Is that the right way to
> link to hive metastore .


Yes.


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
On Fri, Mar 6, 2015 at 11:58 AM, sandeep vura  wrote:

> Can i get document how to create that setup .i mean i need hive
> integration on spark


http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables


Re: Help with transformWith in SparkStreaming

2015-03-06 Thread Laeeq Ahmed
Yes this is the problem. I want to return an RDD but it is abstract and I 
cannot instantiate it. So what are other options. I have two streams and I want 
to filter this stream on the basis of other and also want keep the value of 
other stream. I have also tried join. But one stream has more values than other 
in each sliding window and after join I get repetitions which I don't want.
Regards,Laeeq 

 On Friday, March 6, 2015 8:11 PM, Sean Owen  wrote:
   

 What is this line supposed to mean?

RDD[(first,second,third)]

It's not valid as a line of code, and you don't instantiate RDDs anyway.

On Fri, Mar 6, 2015 at 7:06 PM, Laeeq Ahmed
 wrote:
> Hi,
>
> I am filtering first DStream with the value in second DStream. I also want
> to keep the value of second Dstream. I have done the following and having
> problem with returning new RDD:
>
> val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1:
> RDD[(String,String)], rdd2 : RDD[Int]) => {
>                                                                  var first
> = " "; var second = " "; var third = 0
>                                                                  if
> (rdd2.first<=3)
>
> {
>
> first = rdd1.map(_._1).first
>
> second = rdd1.map(_._2).first
>
> third = rdd2.first
>
> }
>
> RDD[(first,second,third)]
>                                                                })
>
> ERROR
> /home/hduser/Projects/scalaad/src/main/scala/eeg/anomd/StreamAnomalyDetector.scala:119:
> error: not found: value RDD
> [ERROR] RDD[(first,second,third)]
>
> I am imported the import org.apache.spark.rdd.RDD
>
>
> Regards,
> Laeeq
>


   

Re: Spark Streaming Switchover Time

2015-03-06 Thread Tathagata Das
It is probably the time taken by the system to figure out that the worker
is down. Could you see in the logs to find what goes on when you kill the
worker?

TD

On Wed, Mar 4, 2015 at 6:20 AM, Nastooh Avessta (navesta)  wrote:

>  Indeed. And am wondering if this switchover time can be decreased.
>
> Cheers,
>
>
>
> [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]
>
> *Nastooh Avessta*
> ENGINEER.SOFTWARE ENGINEERING
> nave...@cisco.com
> Phone: *+1 604 647 1527 <%2B1%20604%20647%201527>*
>
> *Cisco Systems Limited*
> 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
> VANCOUVER
> BRITISH COLUMBIA
> V7X 1J1
> CA
> Cisco.com 
>
>
>
> [image: Think before you print.]Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/index.html
>
> Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J
> 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences
>  - Unsubscribe
>  – Privacy
> *
>
>
>
> *From:* Tathagata Das [mailto:t...@databricks.com]
> *Sent:* Tuesday, March 03, 2015 11:11 PM
>
> *To:* Nastooh Avessta (navesta)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark Streaming Switchover Time
>
>
>
> I am confused. Are you killing the 1st worker node to see whether the
> system restarts the receiver on the second worker?
>
>
>
> TD
>
>
>
> On Tue, Mar 3, 2015 at 10:49 PM, Nastooh Avessta (navesta) <
> nave...@cisco.com> wrote:
>
> This is the time that it takes for the driver to start receiving data once
> again, from the 2nd worker, when the 1st worker, where streaming thread
> was initially running, is shutdown.
>
> Cheers,
>
>
>
> [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]
>
> *Nastooh Avessta*
> ENGINEER.SOFTWARE ENGINEERING
> nave...@cisco.com
> Phone: *+1 604 647 1527 <%2B1%20604%20647%201527>*
>
> *Cisco Systems Limited*
> 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
> VANCOUVER
> BRITISH COLUMBIA
> V7X 1J1
> CA
> Cisco.com 
>
>
>
> [image: Think before you print.]Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/index.html
>
> Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J
> 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences
>  - Unsubscribe
>  – Privacy
> *
>
>
>
> *From:* Tathagata Das [mailto:t...@databricks.com]
> *Sent:* Tuesday, March 03, 2015 10:24 PM
> *To:* Nastooh Avessta (navesta)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark Streaming Switchover Time
>
>
>
> Can you elaborate on what is this switchover time?
>
>
>
> TD
>
>
>
> On Tue, Mar 3, 2015 at 9:57 PM, Nastooh Avessta (navesta) <
> nave...@cisco.com> wrote:
>
> Hi
>
> On a standalone, Spark 1.0.0, with 1 master and 2 workers, operating in
> client mode, running a udp streaming application, I am noting around 2
> second elapse time on switchover, upon shutting down the streaming worker,
> where streaming window length is 1 sec. I am wondering what parameters are
> available to the developer to shorten this switchover time.
>
> Cheers,
>
>
>
> [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]
>
> *Nastooh Avessta*
> ENGINEER.SOFTWARE ENGINEERING
> nave...@cisco.com
> Phone: *+1 604 647 1527 <%2B1%20604%20647%201527>*
>
> *Cisco Systems Limited*
> 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
> VANCOUVER
> BRITISH COLUMBIA
> V7X 1J1
> CA
> Cisco.com 
>
>
>
> [image: Think before you print.]Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the 

HiveContext test, "Spark Context did not initialize after waiting 10000ms"

2015-03-06 Thread nitinkak001
I am trying to run a Hive query from Spark using HiveContext. Here is the
code

/ val conf = new SparkConf().setAppName("HiveSparkIntegrationTest")

   
conf.set("spark.executor.extraClassPath",
"/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib");
conf.set("spark.driver.extraClassPath",
"/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib");
conf.set("spark.yarn.am.waitTime", "30L")

val sc = new SparkContext(conf)

val sqlContext = new HiveContext(sc)

def inputRDD = sqlContext.sql("describe
spark_poc.src_digital_profile_user");

inputRDD.collect().foreach { println }

println(inputRDD.schema.getClass.getName)
/

Getting this exception. Any clues? The weird part is if I try to do the same
thing but in Java instead of Scala, it runs fine.

/Exception in thread "Driver" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
15/03/06 17:39:32 ERROR yarn.ApplicationMaster: SparkContext did not
initialize after waiting for 1 ms. Please check earlier log output for
errors. Failing the application.
Exception in thread "main" java.lang.NullPointerException
at
org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized(ApplicationMaster.scala:218)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:110)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:434)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:433)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
15/03/06 17:39:32 INFO yarn.ApplicationMaster: AppMaster received a signal./



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-test-Spark-Context-did-not-initialize-after-waiting-1ms-tp21953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HiveContext test, "Spark Context did not initialize after waiting 10000ms"

2015-03-06 Thread Marcelo Vanzin
On Fri, Mar 6, 2015 at 2:47 PM, nitinkak001  wrote:
> I am trying to run a Hive query from Spark using HiveContext. Here is the
> code
>
> / val conf = new SparkConf().setAppName("HiveSparkIntegrationTest")
>
>
> conf.set("spark.executor.extraClassPath",
> "/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib");
> conf.set("spark.driver.extraClassPath",
> "/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib");
> conf.set("spark.yarn.am.waitTime", "30L")

You're missing "/*" at the end of your classpath entries. Also, since
you're on CDH 5.2, you'll probably need to filter out the guava jar
from Hive's lib directory, otherwise things might break. So things
will get a little more complicated.

With CDH 5.3 you shouldn't need to filter out the guava jar.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
i added it

On Fri, Mar 6, 2015 at 2:40 PM, Burak Yavuz  wrote:

> Hi Koert,
>
> Would you like to register this on spark-packages.org?
>
> Burak
>
> On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers  wrote:
>
>> currently spark provides many excellent algorithms for operations per key
>> as long as the data send to the reducers per key fits in memory. operations
>> like combineByKey, reduceByKey and foldByKey rely on pushing the operation
>> map-side so that the data reduce-side is small. and groupByKey simply
>> requires that the values per key fit in memory.
>>
>> but there are algorithms for which we would like to process all the
>> values per key reduce-side, even when they do not fit in memory. examples
>> are algorithms that need to process the values ordered, or algorithms that
>> need to emit all values again. basically this is what the original hadoop
>> reduce operation did so well: it allowed sorting of values (using secondary
>> sort), and it processed all values per key in a streaming fashion.
>>
>> the library spark-sorted aims to bring these kind of operations back to
>> spark, by providing a way to process values with a user provided
>> Ordering[V] and a user provided streaming operation Iterator[V] =>
>> Iterator[W]. it does not make the assumption that the values need to fit in
>> memory per key.
>>
>> the basic idea is to rely on spark's sort-based shuffle to re-arrange the
>> data so that all values for a given key are placed consecutively within a
>> single partition, and then process them using a map-like operation.
>>
>> you can find the project here:
>> https://github.com/tresata/spark-sorted
>>
>> the project is in a very early stage. any feedback is very much
>> appreciated.
>>
>>
>>
>>
>


Spark streaming and executor object reusage

2015-03-06 Thread Jean-Pascal Billaud
Hi,

Reading through the Spark Streaming Programming Guide, I read in the
"Design Patterns for using foreachRDD":

"Finally, this can be further optimized by reusing connection objects
across multiple RDDs/batches.
One can maintain a static pool of connection objects than can be reused as
RDDs of multiple batches are pushed to the external system"

I have this connection pool that might be more or less heavy to
instantiate. I don't use it as part of a foreachRDD but as part of regular
map operations to query some api service. I'd like to understand what
"multiple batches" means here. Is this across RDDs on a single DStream?
Across multiple DStreams?

I'd like to understand what's the context sharability across DStreams over
time. Is it expected that the executor initializing my Factory will keep
getting batches from my streaming job while using the same singleton
connection pool over and over? Or Spark resets executors states after each
DStream is completed to allocated executors to other streaming job
potentially?

Thanks,