date:20160315

Reg:Reading a csv file with String label into labelepoint

2016-03-15 Thread Dharmin Siddesh J

Hi

I am trying to read a csv with few double attributes and String Label . How
can i convert it to labelpoint RDD so that i can run it with spark mllib
classification algorithms.

I have tried
The LabelPoint Constructor (is available only for Regression ) but it
accepts only double format label. Is there any other way to point out the
string label and convert it into RDD

Regards
Siddesh

Re: How to add an accumulator for a Set in Spark

2016-03-15 Thread pppsunil

Have you looked at using Accumulable interface,  Take a look at Spark
documentation at
http://spark.apache.org/docs/latest/programming-guide.html#accumulators it
gives example of how to use vector type for accumalator, which might be very
close to what you need 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-an-accumulator-for-a-Set-in-Spark-tp26510p26514.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

convert row to map of key as int and values as arrays

2016-03-15 Thread Divya Gehlot

Hi,
As I cant add colmns from another Dataframe
I am planning to  my row coulmns to map of key and arrays
As I am new to scala and spark
I am trying like below

// create an empty map
import scala.collection.mutable.{ArrayBuffer => mArrayBuffer}
var map = Map[Int,mArrayBuffer[Any]]()


def addNode(key: String, value:ArrayBuffer[Any] ) ={
nodes += (key -> (value :: (nodes get key getOrElse Nil)))
 }

  var rows = dfLnItmMappng.collect()
rows.foreach(r =>  addNode(r.getInt(2),
(r.getString(1),r.getString(3),r.getString(4),r.getString(5
for ((k,v) <- rows)
printf("key: %s, value: %s\n", k, v)

But I am getting below error :
import scala.collection.mutable.{ArrayBuffer=>mArrayBuffer}
map:
scala.collection.immutable.Map[Int,scala.collection.mutable.ArrayBuffer[Any]]
= Map()
:28: error: not found: value nodes
nodes += (key -> (value :: (nodes get key getOrElse Nil)))
^
:27: error: not found: type ArrayBuffer
   def addNode(key: String, value:ArrayBuffer[Any] ) ={



If anybody knows  better method to add coulmns from another
dataframe,please help by letting me know .


Thanks,
Divya

Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-15 Thread Imre Nagi

Hi,

I'm just trying to process the data that come from the kafka source in my
spark streaming application. What I want to do is get the pair of topic and
message in a tuple from the message stream.

Here is my streams:

 val streams = KafkaUtils.createDirectStream[String, Array[Byte],
> StringDecoder, DefaultDecoder](ssc,kafkaParameter,
>   Array["topic1", "topic2])


I have done several things, but still failed when i did some
transformations from the streams to the pair of topic and message. I hope
somebody can help me here.

Thanks,
Imre

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph

Thanks Mark and Jeff

On Wed, Mar 16, 2016 at 7:11 AM, Mark Hamstra 
wrote:

> Looks to me like the one remaining Stage would execute 19788 Task if all
> of those Tasks succeeded on the first try; but because of retries, 19841
> Tasks were actually executed.  Meanwhile, there were 41405 Tasks in the the
> 163 Stages that were skipped.
>
> I think -- but the Spark UI's accounting may not be 100% accurate and bug
> free.
>
> On Tue, Mar 15, 2016 at 6:34 PM, Prabhu Joseph  > wrote:
>
>> Okay, so out of 164 stages, is 163 are skipped. And how 41405 tasks are
>> skipped if the total is only 19788.
>>
>> On Wed, Mar 16, 2016 at 6:31 AM, Mark Hamstra 
>> wrote:
>>
>>> It's not just if the RDD is explicitly cached, but also if the map
>>> outputs for stages have been materialized into shuffle files and are still
>>> accessible through the map output tracker.  Because of that, explicitly
>>> caching RDD actions often gains you little or nothing, since even without a
>>> call to cache() or persist() the prior computation will largely be reused
>>> and stages will show up as skipped -- i.e. no need to recompute that stage.
>>>
>>> On Tue, Mar 15, 2016 at 5:50 PM, Jeff Zhang  wrote:
>>>
 If RDD is cached, this RDD is only computed once and the stages for
 computing this RDD in the following jobs are skipped.


 On Wed, Mar 16, 2016 at 8:14 AM, Prabhu Joseph <
 prabhujose.ga...@gmail.com> wrote:

> Hi All,
>
>
> Spark UI Completed Jobs section shows below information, what is the
> skipped value shown for Stages and Tasks below.
>
> Job_IDDescriptionSubmitted
> Duration   Stages (Succeeded/Total)Tasks (for all stages):
> Succeeded/Total
>
> 11 count  2016/03/14 15:35:32  1.4
> min 164/164 * (163 skipped)   *19841/19788
> *(41405 skipped)*
> Thanks,
> Prabhu Joseph
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread sychungd

Hi Jeff,
sorry forgot to mention that
the same java code works fine if we replace the python pi.py file with the
jar version of pi example.

|->
|Jeff Zhang   |
|   |
| |
| |
| |
|2016/03/16 上午 11:05|
|->

>|
  | 
   |
  | 
   |
  | 
 To|
  |sychu...@tsmc.com
   |
  | 
 cc|
  |user  
   |
  | 
Subject|
  |Re: Job failed while submitting python to yarn programatically   
   |
  | 
   |
  | 
   |
  | 
   |
  | 
   |
  | 
   |

>|

Could you try yarn-cluster mode ? Make sure your cluster nodes can reach
your client machine and no firewall.

On Wed, Mar 16, 2016 at 10:54 AM,  wrote:

  Hi all,

  We're trying to submit a python file, pi.py in this case, to yarn from
  java
  code but this kept failing(1.6.0).
  It seems the AM uses the arguments we passed to pi.py as the driver IP
  address.
  Could someone help me figuring out how to get the job done. Thanks in
  advance.

  The java code looks like below:

            String[] args = new String[]{
                  "--name",
                  "Test Submit Python To Yarn From Java",
                  "--primary-py-file",
                  SPARK_HOME + "/examples/src/main/python/pi.py",
                  "--num-executors",
                  "5",
                  "--driver-memory",
                  "512m",
                  "--executor-memory",
                  "512m",
                  "--executor-cores",
                  "1",
                  "--arg",
                  args[0]
              };

              Configuration config = new Configuration();
              SparkConf sparkConf = new SparkConf();
              ClientArguments clientArgs = new ClientArguments(args,
  sparkConf
  );
              Client client = new Client(clientArgs, config, sparkConf);
              client.run();

  The jar is submitted by spark-submit::
  ./bin/spark-submit --class SubmitPyYARNJobFromJava --master yarn-client
  TestSubmitPythonFromJava.jar 10

  The job submit to yarn just stay in ACCEPTED before it failed
  What I can't figure out is, yarn log shows AM couldn't reach the driver
  at
  10:0, which is my argument passed to pi.py

  SLF4J: Class path contains multiple SLF4J bindings.
  SLF4J: Found binding in

[jar:file:/data/1/yarn/local/usercache/root/filecache/2084/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

  SLF4J: Found binding in

[jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

  SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
  explanation.
  SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
  16/03/15 17:54:44 INFO yarn.ApplicationMaster: Registered signal handlers
  for [TERM, HUP, INT]
  16/03/15 17:54:45 INFO yarn.ApplicationMaster: ApplicationAttemptId:
  appattempt_1458023046377_0499_01
  16/03/15 17:54:45 INFO spark.SecurityManager: Changing view acls to:
  yarn,root
  16/03/15 17:54:45 INFO spark.SecurityManager: Changing modify acls to:
  yarn,root
  16/03/15 17:54:45 INFO spark.SecurityManager: SecurityManager:
  authentication disabled; ui acls disabled; users with

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Saisai Shao

You cannot directly invoke Spark application by using yarn#client like what
you mentioned, it is deprecated and not supported. you have to use
spark-submit to submit a Spark application to yarn.

Also here the specific problem is that you're invoking yarn#client to run
spark app as yarn-client mode (by default), in which AM expected that
driver is already started,  but here apparently not, so AM will throw such
exception.

Anyway, this way of submitting spark application is not a supported way for
now, please refer to the docs for spark-submit.

Thanks
Saisai

On Wed, Mar 16, 2016 at 11:05 AM, Jeff Zhang  wrote:

> Could you try yarn-cluster mode ? Make sure your cluster nodes can reach
> your client machine and no firewall.
>
> On Wed, Mar 16, 2016 at 10:54 AM,  wrote:
>
>>
>> Hi all,
>>
>> We're trying to submit a python file, pi.py in this case, to yarn from
>> java
>> code but this kept failing(1.6.0).
>> It seems the AM uses the arguments we passed to pi.py as the driver IP
>> address.
>> Could someone help me figuring out how to get the job done. Thanks in
>> advance.
>>
>> The java code looks like below:
>>
>>   String[] args = new String[]{
>> "--name",
>> "Test Submit Python To Yarn From Java",
>> "--primary-py-file",
>> SPARK_HOME + "/examples/src/main/python/pi.py",
>> "--num-executors",
>> "5",
>> "--driver-memory",
>> "512m",
>> "--executor-memory",
>> "512m",
>> "--executor-cores",
>> "1",
>> "--arg",
>> args[0]
>> };
>>
>> Configuration config = new Configuration();
>> SparkConf sparkConf = new SparkConf();
>> ClientArguments clientArgs = new ClientArguments(args,
>> sparkConf
>> );
>> Client client = new Client(clientArgs, config, sparkConf);
>> client.run();
>>
>>
>> The jar is submitted by spark-submit::
>> ./bin/spark-submit --class SubmitPyYARNJobFromJava --master yarn-client
>> TestSubmitPythonFromJava.jar 10
>>
>>
>> The job submit to yarn just stay in ACCEPTED before it failed
>> What I can't figure out is, yarn log shows AM couldn't reach the driver at
>> 10:0, which is my argument passed to pi.py
>>
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>>
>> [jar:file:/data/1/yarn/local/usercache/root/filecache/2084/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>
>> SLF4J: Found binding in
>>
>> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 16/03/15 17:54:44 INFO yarn.ApplicationMaster: Registered signal handlers
>> for [TERM, HUP, INT]
>> 16/03/15 17:54:45 INFO yarn.ApplicationMaster: ApplicationAttemptId:
>> appattempt_1458023046377_0499_01
>> 16/03/15 17:54:45 INFO spark.SecurityManager: Changing view acls to:
>> yarn,root
>> 16/03/15 17:54:45 INFO spark.SecurityManager: Changing modify acls to:
>> yarn,root
>> 16/03/15 17:54:45 INFO spark.SecurityManager: SecurityManager:
>> authentication disabled; ui acls disabled; users with view permissions:
>> Set
>> (yarn, root); users with modify permissions: Set(yarn, root)
>> 16/03/15 17:54:45 INFO yarn.ApplicationMaster: Waiting for Spark driver to
>> be reachable.
>> 16/03/15 17:54:45 ERROR yarn.ApplicationMaster: Failed to connect to
>> driver
>> at 10:0, retrying ...
>> 16/03/15 17:54:46 ERROR yarn.ApplicationMaster: Failed to connect to
>> driver
>> at 10:0, retrying ...
>> 16/03/15 17:54:46 ERROR yarn.ApplicationMaster: Failed to connect to
>> driver
>> at 10:0, retrying ...
>> .
>> 16/03/15 17:56:25 ERROR yarn.ApplicationMaster: Failed to connect to
>> driver
>> at 10:0, retrying ...
>> 16/03/15 17:56:26 ERROR yarn.ApplicationMaster: Uncaught exception:
>> org.apache.spark.SparkException: Failed to connect to driver!
>>  at
>> org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver
>> (ApplicationMaster.scala:484)
>>  at
>> org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher
>> (ApplicationMaster.scala:345)
>>  at org.apache.spark.deploy.yarn.ApplicationMaster.run
>> (ApplicationMaster.scala:187)
>>  at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun
>> $main$1.apply$mcV$sp(ApplicationMaster.scala:653)
>>  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run
>> (SparkHadoopUtil.scala:69)
>>  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run
>> (SparkHadoopUtil.scala:68)
>>  at java.security.AccessController.doPrivileged(Native
>> Method)
>>

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Jeff Zhang

Could you try yarn-cluster mode ? Make sure your cluster nodes can reach
your client machine and no firewall.

On Wed, Mar 16, 2016 at 10:54 AM,  wrote:

>
> Hi all,
>
> We're trying to submit a python file, pi.py in this case, to yarn from java
> code but this kept failing(1.6.0).
> It seems the AM uses the arguments we passed to pi.py as the driver IP
> address.
> Could someone help me figuring out how to get the job done. Thanks in
> advance.
>
> The java code looks like below:
>
>   String[] args = new String[]{
> "--name",
> "Test Submit Python To Yarn From Java",
> "--primary-py-file",
> SPARK_HOME + "/examples/src/main/python/pi.py",
> "--num-executors",
> "5",
> "--driver-memory",
> "512m",
> "--executor-memory",
> "512m",
> "--executor-cores",
> "1",
> "--arg",
> args[0]
> };
>
> Configuration config = new Configuration();
> SparkConf sparkConf = new SparkConf();
> ClientArguments clientArgs = new ClientArguments(args,
> sparkConf
> );
> Client client = new Client(clientArgs, config, sparkConf);
> client.run();
>
>
> The jar is submitted by spark-submit::
> ./bin/spark-submit --class SubmitPyYARNJobFromJava --master yarn-client
> TestSubmitPythonFromJava.jar 10
>
>
> The job submit to yarn just stay in ACCEPTED before it failed
> What I can't figure out is, yarn log shows AM couldn't reach the driver at
> 10:0, which is my argument passed to pi.py
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
>
> [jar:file:/data/1/yarn/local/usercache/root/filecache/2084/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: Found binding in
>
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/03/15 17:54:44 INFO yarn.ApplicationMaster: Registered signal handlers
> for [TERM, HUP, INT]
> 16/03/15 17:54:45 INFO yarn.ApplicationMaster: ApplicationAttemptId:
> appattempt_1458023046377_0499_01
> 16/03/15 17:54:45 INFO spark.SecurityManager: Changing view acls to:
> yarn,root
> 16/03/15 17:54:45 INFO spark.SecurityManager: Changing modify acls to:
> yarn,root
> 16/03/15 17:54:45 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions: Set
> (yarn, root); users with modify permissions: Set(yarn, root)
> 16/03/15 17:54:45 INFO yarn.ApplicationMaster: Waiting for Spark driver to
> be reachable.
> 16/03/15 17:54:45 ERROR yarn.ApplicationMaster: Failed to connect to driver
> at 10:0, retrying ...
> 16/03/15 17:54:46 ERROR yarn.ApplicationMaster: Failed to connect to driver
> at 10:0, retrying ...
> 16/03/15 17:54:46 ERROR yarn.ApplicationMaster: Failed to connect to driver
> at 10:0, retrying ...
> .
> 16/03/15 17:56:25 ERROR yarn.ApplicationMaster: Failed to connect to driver
> at 10:0, retrying ...
> 16/03/15 17:56:26 ERROR yarn.ApplicationMaster: Uncaught exception:
> org.apache.spark.SparkException: Failed to connect to driver!
>  at
> org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver
> (ApplicationMaster.scala:484)
>  at
> org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher
> (ApplicationMaster.scala:345)
>  at org.apache.spark.deploy.yarn.ApplicationMaster.run
> (ApplicationMaster.scala:187)
>  at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun
> $main$1.apply$mcV$sp(ApplicationMaster.scala:653)
>  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run
> (SparkHadoopUtil.scala:69)
>  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run
> (SparkHadoopUtil.scala:68)
>  at java.security.AccessController.doPrivileged(Native
> Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at org.apache.hadoop.security.UserGroupInformation.doAs
> (UserGroupInformation.java:1628)
>  at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser
> (SparkHadoopUtil.scala:68)
>  at org.apache.spark.deploy.yarn.ApplicationMaster$.main
> (ApplicationMaster.scala:651)
>  at org.apache.spark.deploy.yarn.ExecutorLauncher$.main
> (ApplicationMaster.scala:674)
>  at org.apache.spark.deploy.yarn.ExecutorLauncher.main
> (ApplicationMaster.scala)
> 16/03/15 17:56:26 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException:
> Failed to connect to

Fwd: Connection failure followed by bad shuffle files during shuffle

2016-03-15 Thread Eric Martin

Hi,

I'm running into consistent failures during a shuffle read while trying to
do a group-by followed by a count aggregation (using the DataFrame API on
Spark 1.5.2).

The shuffle read (in stage 1) fails with

org.apache.spark.shuffle.FetchFailedException: Failed to send RPC
7719188499899260109 to host_a/ip_a:35946:
java.nio.channels.ClosedChannelException
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:321)


Looking into executor logs shows first shows

ERROR TransportChannelHandler: Connection to host_b/ip_b:38804 has been
quiet for 12 ms while there are outstanding requests. Assuming
connection is dead; please adjust spark.network.timeout if this is wrong.

on the node that threw the FetchFailedException (host_a) and

ERROR TransportRequestHandler: Error sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=207789700738,
chunkIndex=894},
buffer=FileSegmentManagedBuffer{file=/local_disk/spark-ed6667d4-445b-4d65-bfda-e4540b7215aa/executor-d03e5e7e-57d4-40e2-9021-c20d0b84bf75/blockmgr-05d5f2b6-142e-415c-a08b-58d16a10b8bf/27/shuffle_1_13732_0.data,
offset=18960736, length=19477}} to /ip_a:32991; closing connection

on the node referenced in the exception (host_b). The error in the host_b
logs occurred a few seconds after the error in the host_a logs. I noticed
there was a lot of spilling going on during the shuffle read, so I
attempted to work around this problem by increasing the number of shuffle
partitions (to decrease spilling) as well as increasing
spark.network.timeout. Neither of these got rid of these connection
failures.

This causes some of stage 0 to recompute (which runs successfully). Stage 1
retry 1 then always fails with

java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)

Changing the spark.io.compression.codec to lz4 changes this error to

java.io.IOException: Stream is corrupted
at
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:153)

which leads me to believe that the timeout during the shuffle read failure
leaves invalid files on disk.

Notably, these failures do not occur when I run on smaller subsets of data.
The failure is occurring while attempting to group ~100 billion rows into
20 billion groups (with key size of 24 bytes and count as the only
aggregation) on a 16 node cluster. I've replicated this failure on 2
completely separate clusters (both running with standalone cluster manager).

Does anyone have suggestions about how I could make this crash go away or
how I could try to make a smaller failing test case so the bug can be more
easily investigated?

Best,
Eric Martin

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang

It's same as hive thrift server. I believe kerberos is supported.

On Wed, Mar 16, 2016 at 10:48 AM, ayan guha  wrote:

> so, how about implementing security? Any pointer will be helpful
>
> On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang  wrote:
>
>> The spark thrift server allow you to run hive query in spark engine. It
>> can be used as jdbc server.
>>
>> On Wed, Mar 16, 2016 at 10:42 AM, ayan guha  wrote:
>>
>>> Sorry to be dumb-head today, but what is the purpose of spark
>>> thriftserver then? In other words, should I view spark thriftserver as a
>>> better version of hive one (with Spark as execution engine instead of
>>> MR/Tez) OR should I see it as a JDBC server?
>>>
>>> On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang  wrote:
>>>
 spark thrift server is very similar with hive thrift server. You can
 use hive jdbc driver to access spark thrift server. AFAIK, all the features
 of hive thrift server are also available in spark thrift server.

 On Wed, Mar 16, 2016 at 8:39 AM, ayan guha  wrote:

> Hi All
>
> My understanding about thriftserver is to use it to expose pre-loaded
> RDD/dataframes to tools who can connect through JDBC.
>
> Is there something like Spark JDBC server too? Does it do the same
> thing? What is the difference between these two?
>
> How does Spark JDBC/Thrift supports security? Can we restrict certain
> users to access certain dataframes and not the others?
>
> --
> Best Regards,
> Ayan Guha
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Best Regards

Jeff Zhang

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang

The spark thrift server allow you to run hive query in spark engine. It can
be used as jdbc server.

On Wed, Mar 16, 2016 at 10:42 AM, ayan guha  wrote:

> Sorry to be dumb-head today, but what is the purpose of spark thriftserver
> then? In other words, should I view spark thriftserver as a better version
> of hive one (with Spark as execution engine instead of MR/Tez) OR should I
> see it as a JDBC server?
>
> On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang  wrote:
>
>> spark thrift server is very similar with hive thrift server. You can use
>> hive jdbc driver to access spark thrift server. AFAIK, all the features of
>> hive thrift server are also available in spark thrift server.
>>
>> On Wed, Mar 16, 2016 at 8:39 AM, ayan guha  wrote:
>>
>>> Hi All
>>>
>>> My understanding about thriftserver is to use it to expose pre-loaded
>>> RDD/dataframes to tools who can connect through JDBC.
>>>
>>> Is there something like Spark JDBC server too? Does it do the same
>>> thing? What is the difference between these two?
>>>
>>> How does Spark JDBC/Thrift supports security? Can we restrict certain
>>> users to access certain dataframes and not the others?
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Best Regards

Jeff Zhang

Does parallelize and collect preserve the original order of list?

2016-03-15 Thread JoneZhang

Step1
List items = new ArrayList();items.addAll(XXX);
javaSparkContext.parallelize(items).saveAsTextFile(output);
Step2
final List items2 = ctx.textFile(output).collect();

Does items and items2 has the same order?


Besh wishes.
Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-parallelize-and-collect-preserve-the-original-order-of-list-tp26512.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-15 Thread craigiggy

I am having trouble with my standalone Spark cluster and I can't seem to find
a solution anywhere. I hope that maybe someone can figure out what is going
wrong so this issue might be resolved and I can continue with my work.

I am currently attempting to use Python and the pyspark library to do
distributed computing. I have two virtual machines set up for this cluster,
one machine is being used as both the master and one of the slaves
(*spark-mastr-1* with ip address: *xx.xx.xx.248*) and the other machine is
being used as just a slave (*spark-wrkr-1* with ip address: *xx.xx.xx.247*).
Both of them have 8GB of memory, 2 virtual sockets with 2 cores per socket
(4 CPU cores per machine for a total of 8 cores in the cluster). Both of
them have passwordless SSHing set up to each other (the master has
passwordless SSHing set up for itself as well since it is also being used as
one of the slaves).

At first I thought that *247* was just unable to connect to *248*, but I ran
a simple test with the Spark shell in order to check that the slaves are
able to talk with the master and they seem to be able to talk to each other
just fine. However, when I attempt to run my pyspark application, I still
run into trouble with *247* connecting with *248*. Then I thought it was a
memory issue, so I allocated 6GB of memory to each machine to use in Spark,
but this did not resolve the issue. Finally, I tried to give the pyspark
application more time before it times out as well as more retry attempts,
but I still get the same error. The error code that stands out to me is:

*org.apache.spark.shuffle.FetchFailedException: Failed to connect to
spark-mastr-1:xx*


The following is the error that I receive on my most recent attempted run of
the application:

Traceback (most recent call last):
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 413, in

main(sc,sw,sw_set)
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 391, in
main
   
run_engine(submission_type,inputSub,mdb_collection,mdb_collectionType,sw_set,sc,weighted=False)
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 332, in
run_engine
similarities_recRDD,recommendations =
recommend(subRDD,mdb_collection,query_format,sw_set,sc)
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 204, in
recommend
idfsCorpusWeightsBroadcast = core.idfsRDD(corpus,sc)
  File "/home/spark/enigma_analytics/rec_engine/core.py", line 38, in
idfsRDD
idfsInputRDD = ta.inverseDocumentFrequency(corpusRDD)
  File "/home/spark/enigma_analytics/rec_engine/textAnalyzer.py", line 106,
in inverseDocumentFrequency
N = corpus.map(lambda doc: doc[0]).distinct().count()
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 1004, in count
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 995, in sum
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 869, in fold
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 771, in collect
  File
"/usr/local/spark/current/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in __call__
  File
"/usr/local/spark/current/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
ResultStage 5 (count at
/home/spark/enigma_analytics/rec_engine/textAnalyzer.py:106) has failed the
maximum allowable number of times: 4. Most recent failure reason:
*/org.apache.spark.shuffle.FetchFailedException: Failed to connect to
spark-mastr-1:44642/*
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
at

Re: S3 Zip File Loading Advice

2016-03-15 Thread Benjamin Kim

Hi Xinh,

I tried to wrap it, but it still didn’t work. I got a 
"java.util.ConcurrentModificationException”.

All,

I have been trying and trying with some help of a coworker, but it’s slow 
going. I have been able to gather a list of the s3 files I need to download.

### S3 Lists ###
import scala.collection.JavaConverters._
import java.util.ArrayList
import java.util.zip.{ZipEntry, ZipInputStream}
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{ObjectListing, S3ObjectSummary, 
ListObjectsRequest, GetObjectRequest}
import org.apache.commons.io.IOUtils
import org.joda.time.{DateTime, Period}
import org.joda.time.format.DateTimeFormat

val s3Bucket = "amg-events"

val formatter = DateTimeFormat.forPattern("/MM/dd/HH")
var then = DateTime.now()

var files = new ArrayList[String]

//S3 Client and List Object Request
val s3Client = new AmazonS3Client()
val listObjectsRequest = new ListObjectsRequest()
var objectListing: ObjectListing = null

//Your S3 Bucket
listObjectsRequest.setBucketName(s3Bucket)

var now = DateTime.now()
var range = 
Iterator.iterate(now.minusDays(1))(_.plus(Period.hours(1))).takeWhile(!_.isAfter(now))
range.foreach(ymdh => {
  //Your Folder path or Prefix
  listObjectsRequest.setPrefix(formatter.print(ymdh))

  //Adding s3:// to the paths and adding to a list
  do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
  if (objectSummary.getKey().contains(".csv.zip") && 
objectSummary.getLastModified().after(then.toDate())) {
//files.add(objectSummary.getKey())
files.add("s3n://" + s3Bucket + "/" + objectSummary.getKey())
  }
}
listObjectsRequest.setMarker(objectListing.getNextMarker())
  } while (objectListing.isTruncated())
})
then = now

//Creating a Scala List for same
val fileList = files.asScala

//Parallelize the Scala List
val fileRDD = sc.parallelize(fileList)

Now, I am trying to go through the list and download each file, unzip each file 
as it comes, and pass the ZipInputStream to the CSV parser. This is where I get 
stuck.

var df: DataFrame = null
for (file <- fileList) {
  val zipfile = s3Client.getObject(new GetObjectRequest(s3Bucket, 
file)).getObjectContent()
  val zis = new ZipInputStream(zipfile)
  var ze = zis.getNextEntry()
//  val fileDf = 
sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").load(zis)
//  if (df != null) {
//df = df.unionAll(fileDf)
//  } else {
//df = fileDf
//  }
}

I don’t know if I am doing it right or not. I also read that parallelizing 
fileList would allow parallel file retrieval. But, I don’t know how to proceed 
from here.

If you can help, I would be grateful.

Thanks,
Ben


> On Mar 9, 2016, at 10:10 AM, Xinh Huynh  wrote:
> 
> Could you wrap the ZipInputStream in a List, since a subtype of 
> TraversableOnce[?] is required?
> 
> case (name, content) => List(new ZipInputStream(content.open))
> 
> Xinh
> 
> On Wed, Mar 9, 2016 at 7:07 AM, Benjamin Kim  > wrote:
> Hi Sabarish,
> 
> I found a similar posting online where I should use the S3 listKeys. 
> http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd
>  
> .
>  Is this what you were thinking?
> 
> And, your assumption is correct. The zipped CSV file contains only a single 
> file. I found this posting. 
> http://stackoverflow.com/questions/28969757/zip-support-in-apache-spark 
> . I 
> see how to do the unzipping, but I cannot get it to work when running the 
> code directly.
> 
> ...
> import java.io .{ IOException, FileOutputStream, 
> FileInputStream, File }
> import java.util.zip.{ ZipEntry, ZipInputStream }
> import org.apache.spark.input.PortableDataStream
> 
> sc.hadoopConfiguration.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKey)
> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretKey)
> 
> val zipFile = 
> "s3n://events/2016/03/01/00/event-20160301.00-4877ff81-928f-4da4-89b6-6d40a28d61c7.csv.zip
>  <>"
> val zipFileRDD = sc.binaryFiles(zipFile).flatMap { case (name: String, 
> content: PortableDataStream) => new ZipInputStream(content.open) }
> 
> :95: error: type mismatch;
>  found   : java.util.zip.ZipInputStream
>  required: TraversableOnce[?]
>  val zipFileRDD = sc.binaryFiles(zipFile).flatMap { case (name, 
> content) => new ZipInputStream(content.open) }
>   
>   ^
> 
> Thanks,
> Ben
> 
>> On Mar 9, 2016, at 12:03 AM, Sabarish

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph

Okay, so out of 164 stages, is 163 are skipped. And how 41405 tasks are
skipped if the total is only 19788.

On Wed, Mar 16, 2016 at 6:31 AM, Mark Hamstra 
wrote:

> It's not just if the RDD is explicitly cached, but also if the map outputs
> for stages have been materialized into shuffle files and are still
> accessible through the map output tracker.  Because of that, explicitly
> caching RDD actions often gains you little or nothing, since even without a
> call to cache() or persist() the prior computation will largely be reused
> and stages will show up as skipped -- i.e. no need to recompute that stage.
>
> On Tue, Mar 15, 2016 at 5:50 PM, Jeff Zhang  wrote:
>
>> If RDD is cached, this RDD is only computed once and the stages for
>> computing this RDD in the following jobs are skipped.
>>
>>
>> On Wed, Mar 16, 2016 at 8:14 AM, Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>
>>> Spark UI Completed Jobs section shows below information, what is the
>>> skipped value shown for Stages and Tasks below.
>>>
>>> Job_IDDescriptionSubmittedDuration
>>> Stages (Succeeded/Total)Tasks (for all stages): Succeeded/Total
>>>
>>> 11 count  2016/03/14 15:35:32  1.4
>>> min 164/164 * (163 skipped)   *19841/19788
>>> *(41405 skipped)*
>>> Thanks,
>>> Prabhu Joseph
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>

RE: sparkR issues ?

2016-03-15 Thread Sun, Rui

I have submitted https://issues.apache.org/jira/browse/SPARK-13905 and a PR for 
it.

From: Alex Kozlov [mailto:ale...@gmail.com]
Sent: Wednesday, March 16, 2016 12:52 AM
To: roni 
Cc: Sun, Rui ; user@spark.apache.org
Subject: Re: sparkR issues ?

Hi Roni, you can probably rename the as.data.frame in 
$SPARK_HOME/R/pkg/R/DataFrame.R and re-install SparkR by running install-dev.sh

On Tue, Mar 15, 2016 at 8:46 AM, roni 
> wrote:
Hi ,
 Is there a work around for this?
 Do i need to file a bug for this?
Thanks
-R

On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui 
> wrote:
It seems as.data.frame() defined in SparkR convers the versions in R base 
package.
We can try to see if we can change the implementation of as.data.frame() in 
SparkR to avoid such covering.

From: Alex Kozlov [mailto:ale...@gmail.com]
Sent: Tuesday, March 15, 2016 2:59 PM
To: roni >
Cc: user@spark.apache.org
Subject: Re: sparkR issues ?

This seems to be a very unfortunate name collision.  SparkR defines it's own 
DataFrame class which shadows what seems to be your own definition.

Is DataFrame something you define?  Can you rename it?

On Mon, Mar 14, 2016 at 10:44 PM, roni 
> wrote:
Hi,
 I am working with bioinformatics and trying to convert some scripts to sparkR 
to fit into other spark jobs.

I tries a simple example from a bioinf lib and as soon as I start sparkR 
environment it does not work.

code as follows -
countData <- matrix(1:100,ncol=4)
condition <- factor(c("A","A","B","B"))
dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)

Works if i dont initialize the sparkR environment.
 if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives 
following error

> dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ 
> condition)
Error in DataFrame(colData, row.names = rownames(colData)) :
  cannot coerce class "data.frame" to a DataFrame

I am really stumped. I am not using any spark function , so i would expect it 
to work as a simple R code.
why it does not work?

Appreciate  the help
-R

--
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com

--
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com

Re: Spark UI Completed Jobs

2016-03-15 Thread Mark Hamstra

It's not just if the RDD is explicitly cached, but also if the map outputs
for stages have been materialized into shuffle files and are still
accessible through the map output tracker.  Because of that, explicitly
caching RDD actions often gains you little or nothing, since even without a
call to cache() or persist() the prior computation will largely be reused
and stages will show up as skipped -- i.e. no need to recompute that stage.

On Tue, Mar 15, 2016 at 5:50 PM, Jeff Zhang  wrote:

> If RDD is cached, this RDD is only computed once and the stages for
> computing this RDD in the following jobs are skipped.
>
>
> On Wed, Mar 16, 2016 at 8:14 AM, Prabhu Joseph  > wrote:
>
>> Hi All,
>>
>>
>> Spark UI Completed Jobs section shows below information, what is the
>> skipped value shown for Stages and Tasks below.
>>
>> Job_IDDescriptionSubmittedDuration
>> Stages (Succeeded/Total)Tasks (for all stages): Succeeded/Total
>>
>> 11 count  2016/03/14 15:35:32  1.4
>> min 164/164 * (163 skipped)   *19841/19788
>> *(41405 skipped)*
>> Thanks,
>> Prabhu Joseph
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: How to add an accumulator for a Set in Spark

2016-03-15 Thread Ted Yu

Please take a look at:
core/src/test/scala/org/apache/spark/AccumulatorSuite.scala

FYI

On Tue, Mar 15, 2016 at 4:29 PM, SRK  wrote:

> Hi,
>
> How do I add an accumulator for a Set in Spark?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-an-accumulator-for-a-Set-in-Spark-tp26510.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Streaming app consume multiple kafka topics

2016-03-15 Thread Imre Nagi

Hi Cody,

Can you give a bit example how to use mapPartitions with a switch on topic?
I've tried, yet still didn't work.

On Tue, Mar 15, 2016 at 9:45 PM, Cody Koeninger  wrote:

> The direct stream gives you access to the topic.  The offset range for
> each partition contains the topic.  That way you can create a single
> stream, and the first thing you do with it is mapPartitions with a
> switch on topic.
>
> Of course, it may make more sense to separate topics into different
> jobs, but if you want it all in one, that's the most straightforward
> way to do it imho.
>
> On Tue, Mar 15, 2016 at 1:55 AM, saurabh guru 
> wrote:
> > I am doing the same thing this way:
> >
> > // Iterate over HashSet of topics
> > Iterator iterator = topicsSet.iterator();
> > JavaPairInputDStream messages;
> > JavaDStream lines;
> > String topic = "";
> > // get messages stream for each topic
> > while (iterator.hasNext()) {
> > topic = iterator.next();
> > // Create direct kafka stream with brokers and topic
> > messages = KafkaUtils.createDirectStream(jssc, String.class,
> > String.class, StringDecoder.class, StringDecoder.class, kafkaParams,
> > new HashSet(Arrays.asList(topic)));
> >
> > // get lines from messages.map
> > lines = messages.map(new Function,
> > String>() {
> > @Override
> > public String call(Tuple2 tuple2) {
> > return tuple2._2();
> > }
> > });
> >
> >
> > switch (topic) {
> > case IMPR_ACC:
> > ImprLogProc.groupAndCount(lines, esImpIndexName,
> IMPR_ACC,
> > new ImprMarshal());
> >
> > break;
> > case EVENTS_ACC:
> > EventLogProc.groupAndCount(lines, esEventIndexName,
> > EVENTS_ACC, new EventMarshal());
> > break;
> >
> > default:
> > logger.error("No matching Kafka topics Found");
> > break;
> > }
> >
> > On Tue, Mar 15, 2016 at 12:22 PM, Akhil Das 
> > wrote:
> >>
> >> One way would be to keep it this way:
> >>
> >> val stream1 = KafkaUtils.createStream(..) // for topic 1
> >>
> >> val stream2 = KafkaUtils.createStream(..) // for topic 2
> >>
> >>
> >> And you will know which stream belongs to which topic.
> >>
> >> Another approach which you can put in your code itself would be to tag
> the
> >> topic name along with the stream that you are creating. Like, create a
> >> tuple(topic, stream) and you will be able to access ._1 as topic and
> ._2 as
> >> the stream.
> >>
> >>
> >> Thanks
> >> Best Regards
> >>
> >> On Tue, Mar 15, 2016 at 12:05 PM, Imre Nagi 
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I'm just trying to create a spark streaming application that consumes
> >>> more than one topics sent by kafka. Then, I want to do different
> further
> >>> processing for data sent by each topic.
> >>>
>  val kafkaStreams = {
>    val kafkaParameter = for (consumerGroup <- consumerGroups)
> yield {
>  Map(
>    "metadata.broker.list" -> ConsumerConfig.metadataBrokerList,
>    "zookeeper.connect" -> ConsumerConfig.zookeeperConnect,
>    "group.id" -> consumerGroup,
>    "zookeeper.connection.timeout.ms" ->
>  ConsumerConfig.zookeeperConnectionTimeout,
>    "schema.registry.url" -> ConsumerConfig.schemaRegistryUrl,
>    "auto.offset.reset" -> ConsumerConfig.autoOffsetReset
>  )
>    }
>    val streams = (0 to kafkaParameter.length - 1) map { p =>
>  KafkaUtils.createStream[String, Array[Byte], StringDecoder,
>  DefaultDecoder](
>    ssc,
>    kafkaParameter(p),
>    Map(topicsArr(p) -> 1),
>    StorageLevel.MEMORY_ONLY_SER
>  ).map(_._2)
>    }
>    val unifiedStream = ssc.union(streams)
>    unifiedStream.repartition(1)
>  }
>  kafkaStreams.foreachRDD(rdd => {
>    rdd.foreachPartition(partitionOfRecords => {
>  partitionOfRecords.foreach ( x =>
>    println(x)
>  )
>    })
>  })
> >>>
> >>>
> >>> So far, I'm able to get the data from several topic. However, I'm still
> >>> unable to
> >>> differentiate the data sent from a topic with another.
> >>>
> >>> Do anybody has an experience in doing this stuff?
> >>>
> >>> Best,
> >>> Imre
> >>
> >>
> >
> >
> >
> > --
> > Thanks,
> > Saurabh
> >
> > :)
>

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang

spark thrift server is very similar with hive thrift server. You can use
hive jdbc driver to access spark thrift server. AFAIK, all the features of
hive thrift server are also available in spark thrift server.

On Wed, Mar 16, 2016 at 8:39 AM, ayan guha  wrote:

> Hi All
>
> My understanding about thriftserver is to use it to expose pre-loaded
> RDD/dataframes to tools who can connect through JDBC.
>
> Is there something like Spark JDBC server too? Does it do the same thing?
> What is the difference between these two?
>
> How does Spark JDBC/Thrift supports security? Can we restrict certain
> users to access certain dataframes and not the others?
>
> --
> Best Regards,
> Ayan Guha
>

-- 
Best Regards

Jeff Zhang

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang

Right, it is a little confusing here. dropTempTable actually means
unregister here. It only deletes the metadata of this table from catalog.
But you can still operate this table by using its dataframe.

On Wed, Mar 16, 2016 at 8:27 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Thanks Jeff
>
> I was looking for something like ‘unregister’
>
>
> In SQL you use drop to delete a table. I always though register was a
> strange function name.
>
> Register **-1 = unregister
> createTable **-1 == dropTable
>
> Andy
>
> From: Jeff Zhang 
> Date: Tuesday, March 15, 2016 at 4:44 PM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: what is the pyspark inverse of registerTempTable()?
>
> >>> sqlContext.registerDataFrameAsTable(df, "table1")
> >>> sqlContext.dropTempTable("table1")
>
>
>
> On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>> Thanks
>>
>> Andy
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>


-- 
Best Regards

Jeff Zhang

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh

that should read anything.sbt

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 16 March 2016 at 00:04, Mich Talebzadeh 
wrote:

> in mvn the build mvn package will look for a file called pom.xml
>
> in sbt the build sbt package will look for a file called anything.smt
>
> It works
>
> Keep it simple
>
> I will write a ksh script that will create both generic and sbt files on
> the  fly in the correct directory (at the top of the tree) and remove them
> after the job finished.
>
> That will keep audit people happy as well 
>
> HTh
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 March 2016 at 23:34, Jakob Odersky  wrote:
>
>> The artifactId in maven basically (in a simple case) corresponds to name
>> in sbt.
>>
>> Note however that you will manually need to append the
>> _scalaBinaryVersion to the artifactId in case you would like to build
>> against multiple scala versions (otherwise maven will overwrite the
>> generated jar with the latest one).
>>
>>
>> On Tue, Mar 15, 2016 at 4:27 PM, Mich Talebzadeh
>>  wrote:
>> > ok  Ted
>> >
>> > In sbt I have
>> >
>> > name := "ImportCSV"
>> > version := "1.0"
>> > scalaVersion := "2.10.4"
>> >
>> > which ends up in importcsv_2.10-1.0.jar as part of
>> > target/scala-2.10/importcsv_2.10-1.0.jar
>> >
>> > In mvn I have
>> >
>> > 1.0
>> > scala
>> >
>> >
>> > Does it matter?
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> >
>> >
>> > On 15 March 2016 at 23:17, Ted Yu  wrote:
>> >>
>> >> 1.0
>> >> ...
>> >> scala
>> >>
>> >> On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh
>> >>  wrote:
>> >>>
>> >>> An observation
>> >>>
>> >>> Once compiled with MVN the job submit works as follows:
>> >>>
>> >>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
>> >>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master
>> >>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12
>> >>> --num-executors=2 target/scala-1.0.jar
>> >>>
>> >>> With sbt it takes this form
>> >>>
>> >>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
>> >>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master
>> >>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12
>> >>> --num-executors=2 target/scala-2.10/importcsv_2.10-1.0.jar
>> >>>
>> >>> They both return the same results. However, why mvnjar file name is
>> >>> different (may be a naive question!)?
>> >>>
>> >>> thanks
>> >>>
>> >>>
>> >>> Dr Mich Talebzadeh
>> >>>
>> >>>
>> >>>
>> >>> LinkedIn
>> >>>
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>>
>> >>>
>> >>>
>> >>> http://talebzadehmich.wordpress.com
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 15 March 2016 at 22:43, Mich Talebzadeh > >
>> >>> wrote:
>> 
>>  Many thanks Ted and thanks for heads up Jakob
>> 
>>  Just these two changes to dependencies
>> 
>>  
>>  org.apache.spark
>>  spark-core_2.10
>>  1.5.1
>>  
>>  
>>  org.apache.spark
>>  spark-sql_2.10
>>  1.5.1
>>  
>> 
>> 
>>  [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0
>>  [INFO]
>> 
>> 
>>  [INFO] BUILD SUCCESS
>>  [INFO]
>> 
>> 
>>  [INFO] Total time: 01:04 min
>>  [INFO] Finished at: 2016-03-15T22:55:08+00:00
>>  [INFO] Final Memory: 32M/1089M
>>  [INFO]
>> 
>> 
>> 
>>  Dr Mich Talebzadeh
>> 
>> 
>> 
>>  LinkedIn
>> 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> 
>> 
>>  http://talebzadehmich.wordpress.com
>> 
>> 
>> 
>> 
>>  On 15 March 2016 at 22:18, Jakob Odersky  wrote:
>> >
>> > Hi Mich,
>> > probably unrelated to the current error you're seeing, however the
>> > following dependencies will bite you later:
>> > spark-hive_2.10
>> > spark-csv_2.11
>> > the problem here is that you're using libraries built for different
>> > Scala binary versions (the numbers after the underscore). The

Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph

Hi All,


Spark UI Completed Jobs section shows below information, what is the
skipped value shown for Stages and Tasks below.

Job_IDDescriptionSubmittedDuration
Stages (Succeeded/Total)Tasks (for all stages): Succeeded/Total

11 count  2016/03/14 15:35:32  1.4 min
164/164 * (163 skipped)   *19841/19788
*(41405 skipped)*
Thanks,
Prabhu Joseph

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh

in mvn the build mvn package will look for a file called pom.xml

in sbt the build sbt package will look for a file called anything.smt

It works

Keep it simple

I will write a ksh script that will create both generic and sbt files on
the  fly in the correct directory (at the top of the tree) and remove them
after the job finished.

That will keep audit people happy as well 

HTh



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 23:34, Jakob Odersky  wrote:

> The artifactId in maven basically (in a simple case) corresponds to name
> in sbt.
>
> Note however that you will manually need to append the
> _scalaBinaryVersion to the artifactId in case you would like to build
> against multiple scala versions (otherwise maven will overwrite the
> generated jar with the latest one).
>
>
> On Tue, Mar 15, 2016 at 4:27 PM, Mich Talebzadeh
>  wrote:
> > ok  Ted
> >
> > In sbt I have
> >
> > name := "ImportCSV"
> > version := "1.0"
> > scalaVersion := "2.10.4"
> >
> > which ends up in importcsv_2.10-1.0.jar as part of
> > target/scala-2.10/importcsv_2.10-1.0.jar
> >
> > In mvn I have
> >
> > 1.0
> > scala
> >
> >
> > Does it matter?
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> >
> >
> > On 15 March 2016 at 23:17, Ted Yu  wrote:
> >>
> >> 1.0
> >> ...
> >> scala
> >>
> >> On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh
> >>  wrote:
> >>>
> >>> An observation
> >>>
> >>> Once compiled with MVN the job submit works as follows:
> >>>
> >>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
> >>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master
> >>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12
> >>> --num-executors=2 target/scala-1.0.jar
> >>>
> >>> With sbt it takes this form
> >>>
> >>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
> >>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master
> >>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12
> >>> --num-executors=2 target/scala-2.10/importcsv_2.10-1.0.jar
> >>>
> >>> They both return the same results. However, why mvnjar file name is
> >>> different (may be a naive question!)?
> >>>
> >>> thanks
> >>>
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn
> >>>
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>>
> >>>
> >>> On 15 March 2016 at 22:43, Mich Talebzadeh 
> >>> wrote:
> 
>  Many thanks Ted and thanks for heads up Jakob
> 
>  Just these two changes to dependencies
> 
>  
>  org.apache.spark
>  spark-core_2.10
>  1.5.1
>  
>  
>  org.apache.spark
>  spark-sql_2.10
>  1.5.1
>  
> 
> 
>  [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0
>  [INFO]
> 
> 
>  [INFO] BUILD SUCCESS
>  [INFO]
> 
> 
>  [INFO] Total time: 01:04 min
>  [INFO] Finished at: 2016-03-15T22:55:08+00:00
>  [INFO] Final Memory: 32M/1089M
>  [INFO]
> 
> 
> 
>  Dr Mich Talebzadeh
> 
> 
> 
>  LinkedIn
> 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
> 
> 
>  http://talebzadehmich.wordpress.com
> 
> 
> 
> 
>  On 15 March 2016 at 22:18, Jakob Odersky  wrote:
> >
> > Hi Mich,
> > probably unrelated to the current error you're seeing, however the
> > following dependencies will bite you later:
> > spark-hive_2.10
> > spark-csv_2.11
> > the problem here is that you're using libraries built for different
> > Scala binary versions (the numbers after the underscore). The simple
> > fix here is to specify the Scala binary version you're project builds
> > for (2.10 in your case, however note that version is EOL, you should
> > upgrade to scala 2.11.8 if possible).
> >
> > On a side note, sbt takes care of handling correct scala versions for
> > you (the double %% actually is a shorthand for appending
> > "_scalaBinaryVersion" to your dependency). It also enables you to
> > build and publish your project seamlessly against multiple versions.
> I
> > would strongly recommend to use it

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang

>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> sqlContext.dropTempTable("table1")



On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Thanks
>
> Andy
>



-- 
Best Regards

Jeff Zhang

what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Andy Davidson

Thanks

Andy

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky

The artifactId in maven basically (in a simple case) corresponds to name in sbt.

Note however that you will manually need to append the
_scalaBinaryVersion to the artifactId in case you would like to build
against multiple scala versions (otherwise maven will overwrite the
generated jar with the latest one).


On Tue, Mar 15, 2016 at 4:27 PM, Mich Talebzadeh
 wrote:
> ok  Ted
>
> In sbt I have
>
> name := "ImportCSV"
> version := "1.0"
> scalaVersion := "2.10.4"
>
> which ends up in importcsv_2.10-1.0.jar as part of
> target/scala-2.10/importcsv_2.10-1.0.jar
>
> In mvn I have
>
> 1.0
> scala
>
>
> Does it matter?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 15 March 2016 at 23:17, Ted Yu  wrote:
>>
>> 1.0
>> ...
>> scala
>>
>> On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh
>>  wrote:
>>>
>>> An observation
>>>
>>> Once compiled with MVN the job submit works as follows:
>>>
>>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
>>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master
>>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12
>>> --num-executors=2 target/scala-1.0.jar
>>>
>>> With sbt it takes this form
>>>
>>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
>>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master
>>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12
>>> --num-executors=2 target/scala-2.10/importcsv_2.10-1.0.jar
>>>
>>> They both return the same results. However, why mvnjar file name is
>>> different (may be a naive question!)?
>>>
>>> thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>> On 15 March 2016 at 22:43, Mich Talebzadeh 
>>> wrote:

 Many thanks Ted and thanks for heads up Jakob

 Just these two changes to dependencies

 
 org.apache.spark
 spark-core_2.10
 1.5.1
 
 
 org.apache.spark
 spark-sql_2.10
 1.5.1
 


 [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0
 [INFO]
 
 [INFO] BUILD SUCCESS
 [INFO]
 
 [INFO] Total time: 01:04 min
 [INFO] Finished at: 2016-03-15T22:55:08+00:00
 [INFO] Final Memory: 32M/1089M
 [INFO]
 

 Dr Mich Talebzadeh



 LinkedIn
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



 http://talebzadehmich.wordpress.com




 On 15 March 2016 at 22:18, Jakob Odersky  wrote:
>
> Hi Mich,
> probably unrelated to the current error you're seeing, however the
> following dependencies will bite you later:
> spark-hive_2.10
> spark-csv_2.11
> the problem here is that you're using libraries built for different
> Scala binary versions (the numbers after the underscore). The simple
> fix here is to specify the Scala binary version you're project builds
> for (2.10 in your case, however note that version is EOL, you should
> upgrade to scala 2.11.8 if possible).
>
> On a side note, sbt takes care of handling correct scala versions for
> you (the double %% actually is a shorthand for appending
> "_scalaBinaryVersion" to your dependency). It also enables you to
> build and publish your project seamlessly against multiple versions. I
> would strongly recommend to use it in Scala projects.
>
> cheers,
> --Jakob
>
>
>
> On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh
>  wrote:
> > Hi,
> >
> > I normally use sbt and using this sbt file works fine for me
> >
> >  cat ImportCSV.sbt
> > name := "ImportCSV"
> > version := "1.0"
> > scalaVersion := "2.10.4"
> > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
> > libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
> > libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
> > libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.3.0"
> >
> > This is my first trial using Mavan and pom
> >
> >
> > my pom.xml file looks like this but throws error at build
> >
> >
> > [DEBUG]   com.univocity:univocity-parsers:jar:1.5.1:compile
> > [INFO]
> >
> >

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu

Feel free to adjust artifact Id and version in maven. 

They're under your control. 

> On Mar 15, 2016, at 4:27 PM, Mich Talebzadeh  
> wrote:
> 
> ok  Ted
> 
> In sbt I have
> 
> name := "ImportCSV"
> version := "1.0"
> scalaVersion := "2.10.4"
> 
> which ends up in importcsv_2.10-1.0.jar as part of 
> target/scala-2.10/importcsv_2.10-1.0.jar
> 
> In mvn I have
> 
> 1.0
> scala
> 
> 
> Does it matter?
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 15 March 2016 at 23:17, Ted Yu  wrote:
>> 1.0
>> ...
>> scala
>> 
>>> On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh 
>>>  wrote:
>>> An observation
>>> 
>>> Once compiled with MVN the job submit works as follows:
>>> 
>>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages 
>>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master 
>>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12 
>>> --num-executors=2 target/scala-1.0.jar
>>> 
>>> With sbt it takes this form
>>> 
>>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages 
>>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master 
>>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12 
>>> --num-executors=2 target/scala-2.10/importcsv_2.10-1.0.jar
>>> 
>>> They both return the same results. However, why mvnjar file name is 
>>> different (may be a naive question!)?
>>> 
>>> thanks
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 15 March 2016 at 22:43, Mich Talebzadeh  
 wrote:
 Many thanks Ted and thanks for heads up Jakob
 
 Just these two changes to dependencies
 
 
 org.apache.spark
 spark-core_2.10
 1.5.1
 
 
 org.apache.spark
 spark-sql_2.10
 1.5.1
 
 
 
 [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0
 [INFO] 
 
 [INFO] BUILD SUCCESS
 [INFO] 
 
 [INFO] Total time: 01:04 min
 [INFO] Finished at: 2016-03-15T22:55:08+00:00
 [INFO] Final Memory: 32M/1089M
 [INFO] 
 
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
  
 
> On 15 March 2016 at 22:18, Jakob Odersky  wrote:
> Hi Mich,
> probably unrelated to the current error you're seeing, however the
> following dependencies will bite you later:
> spark-hive_2.10
> spark-csv_2.11
> the problem here is that you're using libraries built for different
> Scala binary versions (the numbers after the underscore). The simple
> fix here is to specify the Scala binary version you're project builds
> for (2.10 in your case, however note that version is EOL, you should
> upgrade to scala 2.11.8 if possible).
> 
> On a side note, sbt takes care of handling correct scala versions for
> you (the double %% actually is a shorthand for appending
> "_scalaBinaryVersion" to your dependency). It also enables you to
> build and publish your project seamlessly against multiple versions. I
> would strongly recommend to use it in Scala projects.
> 
> cheers,
> --Jakob
> 
> 
> 
> On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh
>  wrote:
> > Hi,
> >
> > I normally use sbt and using this sbt file works fine for me
> >
> >  cat ImportCSV.sbt
> > name := "ImportCSV"
> > version := "1.0"
> > scalaVersion := "2.10.4"
> > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
> > libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
> > libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
> > libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.3.0"
> >
> > This is my first trial using Mavan and pom
> >
> >
> > my pom.xml file looks like this but throws error at build
> >
> >
> > [DEBUG]   com.univocity:univocity-parsers:jar:1.5.1:compile
> > [INFO]
> > 
> > [INFO] BUILD FAILURE
> > [INFO]
> > 
> > [INFO] Total time: 1.326 s
> > [INFO] Finished at: 2016-03-15T22:17:29+00:00
> > [INFO] Final

How to add an accumulator for a Set in Spark

2016-03-15 Thread SRK

Hi,

How do I add an accumulator for a Set in Spark?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-an-accumulator-for-a-Set-in-Spark-tp26510.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

what is the best practice to read configure file in spark streaming

2016-03-15 Thread yaoxiaohua

Hi guys,

I'm using kafka+spark streaming do log analysis.

Now my requirement is that the log alarm rules may change
sometimes.

Rules maybe like this:

App=Hadoop,keywords=oom|Exception|error,threshold=10

The threshold or keywords may update sometimes.

What I do is :

1.   Use a Map[app,logrule] variable to store the log rules. Define it
as a static member.

2.   Use a custom StreamingListener , read the configuration file in
event onBatchStarted

3.   When I use the variable , I found the value is not updated in
windowstream. So now I read the configure file when use

4.   Now I put the log rule in a local path, I should put it in every
worker. 

What 's the best practice to do in this case?

Thanks for your help, I 'm new in spark-streaming , I even not totally
understand the principle.

 

Best Regards,

Evan Yao

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh

ok  Ted

In sbt I have

name := "ImportCSV"
version := "1.0"
scalaVersion := "2.10.4"

which ends up in importcsv_2.10-1.0.jar as part of
*target/scala-2.10/importcsv_2.**10-1.0.jar*

In mvn I have

1.0
scala


Does it matter?


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 23:17, Ted Yu  wrote:

> 1.0
> ...
> scala
>
> On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> An observation
>>
>> Once compiled with MVN the job submit works as follows:
>>
>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark://
>> 50.140.197.217:7077 --executor-memory=12G --executor-cores=12
>> --num-executors=2 *target/scala-1.0.jar*
>>
>> With sbt it takes this form
>>
>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark://
>> 50.140.197.217:7077 --executor-memory=12G --executor-cores=12
>> --num-executors=2
>> *target/scala-2.10/importcsv_2.10-1.0.jar*
>>
>> They both return the same results. However, why mvnjar file name is
>> different (may be a naive question!)?
>>
>> thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 15 March 2016 at 22:43, Mich Talebzadeh 
>> wrote:
>>
>>> Many thanks Ted and thanks for heads up Jakob
>>>
>>> Just these two changes to dependencies
>>>
>>> 
>>> org.apache.spark
>>> spark-core*_2.10*
>>> 1.5.1
>>> 
>>> 
>>> org.apache.spark
>>> spark-sql*_2.10*
>>> 1.5.1
>>> 
>>>
>>>
>>> [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time: 01:04 min
>>> [INFO] Finished at: 2016-03-15T22:55:08+00:00
>>> [INFO] Final Memory: 32M/1089M
>>> [INFO]
>>> 
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 15 March 2016 at 22:18, Jakob Odersky  wrote:
>>>
 Hi Mich,
 probably unrelated to the current error you're seeing, however the
 following dependencies will bite you later:
 spark-hive_2.10
 spark-csv_2.11
 the problem here is that you're using libraries built for different
 Scala binary versions (the numbers after the underscore). The simple
 fix here is to specify the Scala binary version you're project builds
 for (2.10 in your case, however note that version is EOL, you should
 upgrade to scala 2.11.8 if possible).

 On a side note, sbt takes care of handling correct scala versions for
 you (the double %% actually is a shorthand for appending
 "_scalaBinaryVersion" to your dependency). It also enables you to
 build and publish your project seamlessly against multiple versions. I
 would strongly recommend to use it in Scala projects.

 cheers,
 --Jakob



 On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh
  wrote:
 > Hi,
 >
 > I normally use sbt and using this sbt file works fine for me
 >
 >  cat ImportCSV.sbt
 > name := "ImportCSV"
 > version := "1.0"
 > scalaVersion := "2.10.4"
 > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
 > libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
 > libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
 > libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.3.0"
 >
 > This is my first trial using Mavan and pom
 >
 >
 > my pom.xml file looks like this but throws error at build
 >
 >
 > [DEBUG]   com.univocity:univocity-parsers:jar:1.5.1:compile
 > [INFO]
 >
 
 > [INFO] BUILD FAILURE
 > [INFO]
 >
 
 > [INFO] Total time: 1.326 s
 > [INFO] Finished at: 2016-03-15T22:17:29+00:00
 > [INFO] Final Memory: 14M/455M
 > [INFO]
 >

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu

1.0
...
scala

On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh 
wrote:

> An observation
>
> Once compiled with MVN the job submit works as follows:
>
> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark://
> 50.140.197.217:7077 --executor-memory=12G --executor-cores=12
> --num-executors=2 *target/scala-1.0.jar*
>
> With sbt it takes this form
>
> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark://
> 50.140.197.217:7077 --executor-memory=12G --executor-cores=12
> --num-executors=2
> *target/scala-2.10/importcsv_2.10-1.0.jar*
>
> They both return the same results. However, why mvnjar file name is
> different (may be a naive question!)?
>
> thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 March 2016 at 22:43, Mich Talebzadeh 
> wrote:
>
>> Many thanks Ted and thanks for heads up Jakob
>>
>> Just these two changes to dependencies
>>
>> 
>> org.apache.spark
>> spark-core*_2.10*
>> 1.5.1
>> 
>> 
>> org.apache.spark
>> spark-sql*_2.10*
>> 1.5.1
>> 
>>
>>
>> [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0
>> [INFO]
>> 
>> [INFO] BUILD SUCCESS
>> [INFO]
>> 
>> [INFO] Total time: 01:04 min
>> [INFO] Finished at: 2016-03-15T22:55:08+00:00
>> [INFO] Final Memory: 32M/1089M
>> [INFO]
>> 
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 15 March 2016 at 22:18, Jakob Odersky  wrote:
>>
>>> Hi Mich,
>>> probably unrelated to the current error you're seeing, however the
>>> following dependencies will bite you later:
>>> spark-hive_2.10
>>> spark-csv_2.11
>>> the problem here is that you're using libraries built for different
>>> Scala binary versions (the numbers after the underscore). The simple
>>> fix here is to specify the Scala binary version you're project builds
>>> for (2.10 in your case, however note that version is EOL, you should
>>> upgrade to scala 2.11.8 if possible).
>>>
>>> On a side note, sbt takes care of handling correct scala versions for
>>> you (the double %% actually is a shorthand for appending
>>> "_scalaBinaryVersion" to your dependency). It also enables you to
>>> build and publish your project seamlessly against multiple versions. I
>>> would strongly recommend to use it in Scala projects.
>>>
>>> cheers,
>>> --Jakob
>>>
>>>
>>>
>>> On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh
>>>  wrote:
>>> > Hi,
>>> >
>>> > I normally use sbt and using this sbt file works fine for me
>>> >
>>> >  cat ImportCSV.sbt
>>> > name := "ImportCSV"
>>> > version := "1.0"
>>> > scalaVersion := "2.10.4"
>>> > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
>>> > libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
>>> > libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>>> > libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.3.0"
>>> >
>>> > This is my first trial using Mavan and pom
>>> >
>>> >
>>> > my pom.xml file looks like this but throws error at build
>>> >
>>> >
>>> > [DEBUG]   com.univocity:univocity-parsers:jar:1.5.1:compile
>>> > [INFO]
>>> >
>>> 
>>> > [INFO] BUILD FAILURE
>>> > [INFO]
>>> >
>>> 
>>> > [INFO] Total time: 1.326 s
>>> > [INFO] Finished at: 2016-03-15T22:17:29+00:00
>>> > [INFO] Final Memory: 14M/455M
>>> > [INFO]
>>> >
>>> 
>>> > [ERROR] Failed to execute goal on project scala: Could not resolve
>>> > dependencies for project spark:scala:jar:1.0: The following artifacts
>>> could
>>> > not be resolved: org.apache.spark:spark-core:jar:1.5.1,
>>> > org.apache.spark:spark-sql:jar:1.5.1: Failure to find
>>> > org.apache.spark:spark-core:jar:1.5.1 in
>>> > https://repo.maven.apache.org/maven2 was cached in the local
>>> repository,
>>> > resolution will not be reattempted until the update interval of
>>> central has
>>> > elapsed or updates are forced -> [Help 1]
>>> >
>>> >
>>> > My pom file is
>>> >
>>> >
>>> >  cat pom.xml
>>> > http://maven.apache.org/POM/4.0.0;

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh

An observation

Once compiled with MVN the job submit works as follows:

+ /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark://
50.140.197.217:7077 --executor-memory=12G --executor-cores=12
--num-executors=2 *target/scala-1.0.jar*

With sbt it takes this form

+ /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark://
50.140.197.217:7077 --executor-memory=12G --executor-cores=12
--num-executors=2
*target/scala-2.10/importcsv_2.10-1.0.jar*

They both return the same results. However, why mvnjar file name is
different (may be a naive question!)?

thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 22:43, Mich Talebzadeh 
wrote:

> Many thanks Ted and thanks for heads up Jakob
>
> Just these two changes to dependencies
>
> 
> org.apache.spark
> spark-core*_2.10*
> 1.5.1
> 
> 
> org.apache.spark
> spark-sql*_2.10*
> 1.5.1
> 
>
>
> [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 01:04 min
> [INFO] Finished at: 2016-03-15T22:55:08+00:00
> [INFO] Final Memory: 32M/1089M
> [INFO]
> 
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 March 2016 at 22:18, Jakob Odersky  wrote:
>
>> Hi Mich,
>> probably unrelated to the current error you're seeing, however the
>> following dependencies will bite you later:
>> spark-hive_2.10
>> spark-csv_2.11
>> the problem here is that you're using libraries built for different
>> Scala binary versions (the numbers after the underscore). The simple
>> fix here is to specify the Scala binary version you're project builds
>> for (2.10 in your case, however note that version is EOL, you should
>> upgrade to scala 2.11.8 if possible).
>>
>> On a side note, sbt takes care of handling correct scala versions for
>> you (the double %% actually is a shorthand for appending
>> "_scalaBinaryVersion" to your dependency). It also enables you to
>> build and publish your project seamlessly against multiple versions. I
>> would strongly recommend to use it in Scala projects.
>>
>> cheers,
>> --Jakob
>>
>>
>>
>> On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh
>>  wrote:
>> > Hi,
>> >
>> > I normally use sbt and using this sbt file works fine for me
>> >
>> >  cat ImportCSV.sbt
>> > name := "ImportCSV"
>> > version := "1.0"
>> > scalaVersion := "2.10.4"
>> > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
>> > libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
>> > libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>> > libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.3.0"
>> >
>> > This is my first trial using Mavan and pom
>> >
>> >
>> > my pom.xml file looks like this but throws error at build
>> >
>> >
>> > [DEBUG]   com.univocity:univocity-parsers:jar:1.5.1:compile
>> > [INFO]
>> > 
>> > [INFO] BUILD FAILURE
>> > [INFO]
>> > 
>> > [INFO] Total time: 1.326 s
>> > [INFO] Finished at: 2016-03-15T22:17:29+00:00
>> > [INFO] Final Memory: 14M/455M
>> > [INFO]
>> > 
>> > [ERROR] Failed to execute goal on project scala: Could not resolve
>> > dependencies for project spark:scala:jar:1.0: The following artifacts
>> could
>> > not be resolved: org.apache.spark:spark-core:jar:1.5.1,
>> > org.apache.spark:spark-sql:jar:1.5.1: Failure to find
>> > org.apache.spark:spark-core:jar:1.5.1 in
>> > https://repo.maven.apache.org/maven2 was cached in the local
>> repository,
>> > resolution will not be reattempted until the update interval of central
>> has
>> > elapsed or updates are forced -> [Help 1]
>> >
>> >
>> > My pom file is
>> >
>> >
>> >  cat pom.xml
>> > http://maven.apache.org/POM/4.0.0;
>> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>> > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>> > http://maven.apache.org/maven-v4_0_0.xsd;>
>> > 4.0.0
>> > spark
>> > 1.0
>> > ${project.artifactId}
>> >
>> > 
>> > 1.7
>> > 1.7
>> > UTF-8
>> > 2.10.4
>> > 2.15.2
>> >

Re: Get output of the ALS algorithm.

2016-03-15 Thread Bryan Cutler

Jacek is correct for using org.apache.spark.ml.recommendation.ALSModel

If you are trying to save
org.apache.spark.mllib.recommendation.MatrixFactorizationModel, then it is
similar, but just a little different, see the example here
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/RecommendationExample.scala#L62

On Fri, Mar 11, 2016 at 8:18 PM, Shishir Anshuman  wrote:

> The model produced after training.
>
> On Fri, Mar 11, 2016 at 10:29 PM, Bryan Cutler  wrote:
>
>> Are you trying to save predictions on a dataset to a file, or the model
>> produced after training with ALS?
>>
>> On Thu, Mar 10, 2016 at 7:57 PM, Shishir Anshuman <
>> shishiranshu...@gmail.com> wrote:
>>
>>> hello,
>>>
>>> I am new to Apache Spark and would like to get the Recommendation output
>>> of the ALS algorithm in a file.
>>> Please suggest me the solution.
>>>
>>> Thank you
>>>
>>>
>>>
>>
>

spark.ml : eval model outside sparkContext

2016-03-15 Thread Emmanuel

Hello,
In MLLib with Spark 1.4, I was able to eval a model by loading it and using 
`predict` on a vector of features. I would train on Spark but use my model on 
my workflow.

In `spark.ml` it seems like the only way to eval is to use `transform` which 
only takes a DataFrame.To build a DataFrame i need a sparkContext or 
SQLContext, so it doesn't seem to be possible to eval outside of Spark.

Is there either a way to build a DataFrame without a sparkContext, or predict 
with a vector or list of features without a DataFrame?
Thanks

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh

Many thanks Ted and thanks for heads up Jakob

Just these two changes to dependencies


org.apache.spark
spark-core*_2.10*
1.5.1


org.apache.spark
spark-sql*_2.10*
1.5.1



[DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 01:04 min
[INFO] Finished at: 2016-03-15T22:55:08+00:00
[INFO] Final Memory: 32M/1089M
[INFO]


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 22:18, Jakob Odersky  wrote:

> Hi Mich,
> probably unrelated to the current error you're seeing, however the
> following dependencies will bite you later:
> spark-hive_2.10
> spark-csv_2.11
> the problem here is that you're using libraries built for different
> Scala binary versions (the numbers after the underscore). The simple
> fix here is to specify the Scala binary version you're project builds
> for (2.10 in your case, however note that version is EOL, you should
> upgrade to scala 2.11.8 if possible).
>
> On a side note, sbt takes care of handling correct scala versions for
> you (the double %% actually is a shorthand for appending
> "_scalaBinaryVersion" to your dependency). It also enables you to
> build and publish your project seamlessly against multiple versions. I
> would strongly recommend to use it in Scala projects.
>
> cheers,
> --Jakob
>
>
>
> On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh
>  wrote:
> > Hi,
> >
> > I normally use sbt and using this sbt file works fine for me
> >
> >  cat ImportCSV.sbt
> > name := "ImportCSV"
> > version := "1.0"
> > scalaVersion := "2.10.4"
> > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
> > libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
> > libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
> > libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.3.0"
> >
> > This is my first trial using Mavan and pom
> >
> >
> > my pom.xml file looks like this but throws error at build
> >
> >
> > [DEBUG]   com.univocity:univocity-parsers:jar:1.5.1:compile
> > [INFO]
> > 
> > [INFO] BUILD FAILURE
> > [INFO]
> > 
> > [INFO] Total time: 1.326 s
> > [INFO] Finished at: 2016-03-15T22:17:29+00:00
> > [INFO] Final Memory: 14M/455M
> > [INFO]
> > 
> > [ERROR] Failed to execute goal on project scala: Could not resolve
> > dependencies for project spark:scala:jar:1.0: The following artifacts
> could
> > not be resolved: org.apache.spark:spark-core:jar:1.5.1,
> > org.apache.spark:spark-sql:jar:1.5.1: Failure to find
> > org.apache.spark:spark-core:jar:1.5.1 in
> > https://repo.maven.apache.org/maven2 was cached in the local repository,
> > resolution will not be reattempted until the update interval of central
> has
> > elapsed or updates are forced -> [Help 1]
> >
> >
> > My pom file is
> >
> >
> >  cat pom.xml
> > http://maven.apache.org/POM/4.0.0;
> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
> > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> > http://maven.apache.org/maven-v4_0_0.xsd;>
> > 4.0.0
> > spark
> > 1.0
> > ${project.artifactId}
> >
> > 
> > 1.7
> > 1.7
> > UTF-8
> > 2.10.4
> > 2.15.2
> > 
> >
> > 
> >   
> > org.scala-lang
> > scala-library
> > 2.10.2
> >   
> > 
> > org.apache.spark
> > spark-core
> > 1.5.1
> > 
> > 
> > org.apache.spark
> > spark-sql
> > 1.5.1
> > 
> > 
> > org.apache.spark
> > spark-hive_2.10
> > 1.5.0
> > 
> > 
> > com.databricks
> > spark-csv_2.11
> > 1.3.0
> > 
> > 
> >
> > 
> > src/main/scala
> > 
> > 
> > org.scala-tools
> > maven-scala-plugin
> > ${maven-scala-plugin.version}
> > 
> > 
> > 
> > compile
> > 
> > 
> > 
> > 
> > 
> > -Xms64m
> > -Xmx1024m
> > 
> > 
> > 
> > 
> > org.apache.maven.plugins
> > maven-shade-plugin
> > 1.6
> > 
> > 
> > package
> > 
> > shade
> > 
> > 
> > 
> > 
> > *:*
> > 
> > META-INF/*.SF
> > META-INF/*.DSA
> > META-INF/*.RSA
> > 
> > 
> > 
> > 
> >  >
> implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
> > com.group.id.Launcher1
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >
> > scala
> > 
> >
> >
> > I am sure I have omitted something?
> >
> >
> > Thanks
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
>

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky

Hi Mich,
probably unrelated to the current error you're seeing, however the
following dependencies will bite you later:
spark-hive_2.10
spark-csv_2.11
the problem here is that you're using libraries built for different
Scala binary versions (the numbers after the underscore). The simple
fix here is to specify the Scala binary version you're project builds
for (2.10 in your case, however note that version is EOL, you should
upgrade to scala 2.11.8 if possible).

On a side note, sbt takes care of handling correct scala versions for
you (the double %% actually is a shorthand for appending
"_scalaBinaryVersion" to your dependency). It also enables you to
build and publish your project seamlessly against multiple versions. I
would strongly recommend to use it in Scala projects.

cheers,
--Jakob



On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh
 wrote:
> Hi,
>
> I normally use sbt and using this sbt file works fine for me
>
>  cat ImportCSV.sbt
> name := "ImportCSV"
> version := "1.0"
> scalaVersion := "2.10.4"
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
> libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.3.0"
>
> This is my first trial using Mavan and pom
>
>
> my pom.xml file looks like this but throws error at build
>
>
> [DEBUG]   com.univocity:univocity-parsers:jar:1.5.1:compile
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 1.326 s
> [INFO] Finished at: 2016-03-15T22:17:29+00:00
> [INFO] Final Memory: 14M/455M
> [INFO]
> 
> [ERROR] Failed to execute goal on project scala: Could not resolve
> dependencies for project spark:scala:jar:1.0: The following artifacts could
> not be resolved: org.apache.spark:spark-core:jar:1.5.1,
> org.apache.spark:spark-sql:jar:1.5.1: Failure to find
> org.apache.spark:spark-core:jar:1.5.1 in
> https://repo.maven.apache.org/maven2 was cached in the local repository,
> resolution will not be reattempted until the update interval of central has
> elapsed or updates are forced -> [Help 1]
>
>
> My pom file is
>
>
>  cat pom.xml
> http://maven.apache.org/POM/4.0.0;
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/maven-v4_0_0.xsd;>
> 4.0.0
> spark
> 1.0
> ${project.artifactId}
>
> 
> 1.7
> 1.7
> UTF-8
> 2.10.4
> 2.15.2
> 
>
> 
>   
> org.scala-lang
> scala-library
> 2.10.2
>   
> 
> org.apache.spark
> spark-core
> 1.5.1
> 
> 
> org.apache.spark
> spark-sql
> 1.5.1
> 
> 
> org.apache.spark
> spark-hive_2.10
> 1.5.0
> 
> 
> com.databricks
> spark-csv_2.11
> 1.3.0
> 
> 
>
> 
> src/main/scala
> 
> 
> org.scala-tools
> maven-scala-plugin
> ${maven-scala-plugin.version}
> 
> 
> 
> compile
> 
> 
> 
> 
> 
> -Xms64m
> -Xmx1024m
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-shade-plugin
> 1.6
> 
> 
> package
> 
> shade
> 
> 
> 
> 
> *:*
> 
> META-INF/*.SF
> META-INF/*.DSA
> META-INF/*.RSA
> 
> 
> 
> 
>  implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
> com.group.id.Launcher1
> 
> 
> 
> 
> 
> 
> 
> 
>
> scala
> 
>
>
> I am sure I have omitted something?
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh

Hi,

I normally use sbt and using this sbt file works fine for me

 cat ImportCSV.sbt
name := "ImportCSV"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.3.0"

This is my first trial using Mavan and pom


my pom.xml file looks like this but throws error at build


[DEBUG]   com.univocity:univocity-parsers:jar:1.5.1:compile
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 1.326 s
[INFO] Finished at: 2016-03-15T22:17:29+00:00
[INFO] Final Memory: 14M/455M
[INFO]

[ERROR] Failed to execute goal on project scala: Could not resolve
dependencies for project spark:scala:jar:1.0: The following artifacts could
not be resolved: org.apache.spark:spark-core:jar:1.5.1,
org.apache.spark:spark-sql:jar:1.5.1: Failure to find
org.apache.spark:spark-core:jar:1.5.1 in
https://repo.maven.apache.org/maven2 was cached in the local repository,
resolution will not be reattempted until the update interval of central has
elapsed or updates are forced -> [Help 1]


My pom file is


 cat pom.xml
http://maven.apache.org/POM/4.0.0; xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd;>
4.0.0
spark
1.0
${project.artifactId}


1.7
1.7
UTF-8
2.10.4
2.15.2



  
org.scala-lang
scala-library
2.10.2
  

org.apache.spark
spark-core
1.5.1


org.apache.spark
spark-sql
1.5.1


org.apache.spark
spark-hive_2.10
1.5.0


com.databricks
spark-csv_2.11
1.3.0




src/main/scala


org.scala-tools
maven-scala-plugin
${maven-scala-plugin.version}



compile





-Xms64m
-Xmx1024m




org.apache.maven.plugins
maven-shade-plugin
1.6


package

shade




*:*

META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA





com.group.id.Launcher1









scala



I am sure I have omitted something?


Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com

Re: Microsoft SQL dialect issues

2016-03-15 Thread Suresh Thalamati

You should be able to register your own dialect if the default mappings are  
not working for your scenario.

import org.apache.spark.sql.jdbc
JdbcDialects.registerDialect(MyDialect)

Please refer to the  JdbcDialects to find example of  existing default dialect 
for your database or another database.
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
 

https://github.com/apache/spark/tree/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/jdbc
 


 


> On Mar 15, 2016, at 12:41 PM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> Can you please clarify what you are trying to achieve and I guess you mean 
> Transact_SQL for MSSQL?
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 15 March 2016 at 19:09, Andrés Ivaldi  > wrote:
> Hello, I'm trying to use MSSQL, storing data on MSSQL but i'm having dialect 
> problems
> I found this
> https://mail-archives.apache.org/mod_mbox/spark-issues/201510.mbox/%3cjira.12901078.1443461051000.34556.1444123886...@atlassian.jira%3E
>  
> 
> 
> That is what is happening to me, It's possible to define the dialect? so I 
> can override the default for SQLServer?
> 
> Regards. 
> 
> -- 
> Ing. Ivaldi Andres
>

Re: How to convert Parquet file to a text file.

2016-03-15 Thread Kevin Mellott

I'd recommend reading the parquet file into a DataFrame object, and then
using spark-csv  to write to a CSV
file.

On Tue, Mar 15, 2016 at 3:34 PM, Shishir Anshuman  wrote:

> I need to convert the parquet file generated by the spark to a text (csv
> preferably) file. I want to use the data model outside spark.
>
> Any suggestion on how to proceed?
>

Re: newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson

Hi Frank

We have thousands of small files . Each file is between 6K to maybe 100k.

Conductor looks interesting

Andy

From:  Frank Austin Nothaft 
Date:  Tuesday, March 15, 2016 at 11:59 AM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: newbie HDFS S3 best practices

> Hard to say with #1 without knowing your application¹s characteristics; for
> #2, we use conductor   with IAM
> roles, .boto/.aws/credentials files.
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
> 
>> On Mar 15, 2016, at 11:45 AM, Andy Davidson 
>> wrote:
>> 
>> We use the spark-ec2 script to create AWS clusters as needed (we do not use
>> AWS EMR)
>> 1. will we get better performance if we copy data to HDFS before we run
>> instead of reading directly from S3?
>>  2. What is a good way to move results from HDFS to S3?
>> 
>> 
>> It seems like there are many ways to bulk copy to s3. Many of them require we
>> explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
>>  . This seems like a
>> bad idea? 
>> 
>> What would you recommend?
>> 
>> Thanks
>> 
>> Andy
>> 
>> 
>

Re: Spark and KafkaUtils

2016-03-15 Thread Vinti Maheshwari

Hi Cody,

I wanted to update my build.sbt which was working with kafka without giving
any error, it may help other user if they face similar issue.

name := "NetworkStreaming"

version := "1.0"

scalaVersion:= "2.10.5"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-streaming-kafka" % "1.6.0", // kafka
  "org.apache.spark" %% "spark-mllib" % "1.6.0",
  "org.codehaus.groovy" % "groovy-all" % "1.8.6",
  "org.apache.hbase" % "hbase-server" % "1.1.2",
  "org.apache.spark" %% "spark-sql"  % "1.6.0",
  "org.apache.hbase" % "hbase-common" % "1.1.2"
excludeAll(ExclusionRule(organization = "javax.servlet",
name="javax.servlet-api"), ExclusionRule(organization =
"org.mortbay.jetty", name="jetty"), ExclusionRule(organization =
"org.mortbay.jetty", name="servlet-api-2.5")),
  "org.apache.hbase" % "hbase-client" % "1.1.2"
excludeAll(ExclusionRule(organization = "javax.servlet",
name="javax.servlet-api"), ExclusionRule(organization =
"org.mortbay.jetty", name="jetty"), ExclusionRule(organization =
"org.mortbay.jetty", name="servlet-api-2.5"))
)


assemblyMergeStrategy in assembly := {
  case m if m.toLowerCase.endsWith("manifest.mf")  =>
MergeStrategy.discard
  case m if m.toLowerCase.matches("meta-inf.*\\.sf$")  =>
MergeStrategy.discard
  case "log4j.properties"  =>
MergeStrategy.discard
  case m if m.toLowerCase.startsWith("meta-inf/services/") =>
MergeStrategy.filterDistinctLines
  case "reference.conf"=>
MergeStrategy.concat
  case _   =>
MergeStrategy.first
}

Thanks & Regards,

Vinti



On Wed, Feb 24, 2016 at 1:34 PM, Cody Koeninger  wrote:

> Looks like conflicting versions of the same dependency.
> If you look at the mergeStrategy section of the build file I posted, you
> can add additional lines for whatever dependencies are causing issues, e.g.
>
>   case PathList("org", "jboss", "netty", _*) => MergeStrategy.first
>
> On Wed, Feb 24, 2016 at 2:55 PM, Vinti Maheshwari 
> wrote:
>
>> Thanks much Cody, I added assembly.sbt and modified build.sbt with ivy
>> bug related content.
>>
>> It's giving lots of errors related to ivy:
>>
>> *[error]
>> /Users/vintim/.ivy2/cache/javax.activation/activation/jars/activation-1.1.jar:javax/activation/ActivationDataFlavor.class*
>>
>> Here is complete error log:
>> https://gist.github.com/Vibhuti/07c24d2893fa6e520d4c
>>
>>
>> Regards,
>> ~Vinti
>>
>> On Wed, Feb 24, 2016 at 12:16 PM, Cody Koeninger 
>> wrote:
>>
>>> Ok, that build file I linked earlier has a minimal example of use.  just
>>> running 'sbt assembly' given a similar build file should build a jar with
>>> all the dependencies.
>>>
>>> On Wed, Feb 24, 2016 at 1:50 PM, Vinti Maheshwari 
>>> wrote:
>>>
 I am not using sbt assembly currently. I need to check how to use sbt
 assembly.

 Regards,
 ~Vinti

 On Wed, Feb 24, 2016 at 11:10 AM, Cody Koeninger 
 wrote:

> Are you using sbt assembly?  That's what will include all of the
> non-provided dependencies in a single jar along with your code.  Otherwise
> you'd have to specify each separate jar in your spark-submit line, which 
> is
> a pain.
>
> On Wed, Feb 24, 2016 at 12:49 PM, Vinti Maheshwari <
> vinti.u...@gmail.com> wrote:
>
>> Hi Cody,
>>
>> I tried with the build file you provided, but it's not working for
>> me, getting same error:
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/streaming/kafka/KafkaUtils$
>>
>> I am not getting this error while building  (sbt package). I am
>> getting this error when i am running my spark-streaming program.
>> Do i need to specify kafka jar path manually with spark-submit --jars
>> flag?
>>
>> My build.sbt:
>>
>> name := "NetworkStreaming"
>> libraryDependencies += "org.apache.hbase" % "hbase" % "0.92.1"
>>
>> libraryDependencies += "org.apache.hadoop" % "hadoop-core" % "1.0.2"
>>
>> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.0.0"
>>
>> libraryDependencies ++= Seq(
>>   "org.apache.spark" % "spark-streaming_2.10" % "1.5.2",
>>   "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.5.2"
>> )
>>
>>
>>
>> Regards,
>> ~Vinti
>>
>> On Wed, Feb 24, 2016 at 9:33 AM, Cody Koeninger 
>> wrote:
>>
>>> spark streaming is provided, kafka is not.
>>>
>>> This build file
>>>
>>> https://github.com/koeninger/kafka-exactly-once/blob/master/build.sbt
>>>
>>> includes some hacks for ivy issues that may no longer be strictly
>>> necessary, but try that build and see if it works for you.
>>>
>>>
>>> On Wed, Feb 24, 2016 at 11:14 AM, Vinti Maheshwari <
>>>

How to convert Parquet file to a text file.

2016-03-15 Thread Shishir Anshuman

I need to convert the parquet file generated by the spark to a text (csv
preferably) file. I want to use the data model outside spark.

Any suggestion on how to proceed?

Re: Docker configuration for akka spark streaming

2016-03-15 Thread David Gomez Saavedra

The issue is related to this
https://issues.apache.org/jira/browse/SPARK-13906

.set("spark.rpc.netty.dispatcher.numThreads","2")

seem to fix the problem

On Tue, Mar 15, 2016 at 6:45 AM, David Gomez Saavedra 
wrote:

> I have updated the config since I realized the actor system was listening
> on driver port + 1. So changed the ports in my program + the docker images
>
> val conf = new SparkConf()
>   .setMaster(sparkMaster)
>   //.setMaster("local[2]")
>   .setAppName(sparkApp)
>   .set("spark.cassandra.connection.host", CassandraConfig.host)
>   .set("spark.logConf", "true")
>   .set("spark.driver.port","7001")
>   .set("spark.driver.host","192.168.33.10")
>   .set("spark.fileserver.port","6002")
>   .set("spark.broadcast.port","6003")
>   .set("spark.replClassServer.port","6004")
>   .set("spark.blockManager.port","6005")
>   .set("spark.executor.port","6006")
>   
> .set("spark.broadcast.factory","org.apache.spark.broadcast.HttpBroadcastFactory")
>   .setJars(sparkJars)
>
> Netstat of my stream app
>
> tcp6   0  0 :::6002 :::*LISTEN
>  9314/java
> tcp6   0  0 :::6003 :::*LISTEN
>  9314/java
> tcp6   0  0 :::6005 :::*LISTEN
>  9314/java
> tcp6   0  0 192.168.33.10:7001  :::*
>  LISTEN  9314/java
> tcp6   0  0 192.168.33.10:7002  :::*
>  LISTEN  9314/java
> tcp6   0  0 :::4040 :::*LISTEN
>  9314/java
>
> netstat of the master running on docker
>
> Proto Recv-Q Send-Q Local Address   Foreign Address State
>   PID/Program name
> tcp6   0  0 172.18.0.3:7077 :::*
>  LISTEN  -
> tcp6   0  0 :::8080 :::*LISTEN
>  -
> tcp6   0  0 172.18.0.3:6066 :::*
>  LISTEN  -
>
> netstat of worker running on docker
>
> Proto Recv-Q Send-Q Local Address   Foreign Address State
>   PID/Program name
> tcp6   0  0 :::8081 :::*LISTEN
>  -
> tcp6   0  0 :::6005 :::*LISTEN
>  -
> tcp6   0  0 172.18.0.2:6006 :::*
>  LISTEN  -
> tcp6   0  0 172.18.0.2: :::*
>  LISTEN  -
>
>
> so far still no success
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Mar 14, 2016 at 11:14 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Could you use netstat to show the ports that the driver is listening?
>>
>> On Mon, Mar 14, 2016 at 1:45 PM, David Gomez Saavedra 
>> wrote:
>>
>>> hi everyone,
>>>
>>> I'm trying to set up spark streaming using akka with a similar example
>>> of the word count provided. When using spark master in local mode
>>> everything works but when I try to run it the driver and executors using
>>> docker I get the following exception
>>>
>>>
>>> 16/03/14 20:32:03 WARN NettyRpcEndpointRef: Error sending message [message 
>>> = Heartbeat(0,[Lscala.Tuple2;@5ad3f40c,BlockManagerId(0, 172.18.0.4, 
>>> 7005))] in 1 attempts
>>> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 10 
>>> seconds. This timeout is controlled by spark.executor.heartbeatInterval
>>> at 
>>> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>>> at 
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>>> at 
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>>> at 
>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
>>> at scala.util.Try$.apply(Try.scala:192)
>>> at scala.util.Failure.recover(Try.scala:216)
>>> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>>> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>>> at 
>>> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>>> at 
>>> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
>>> at 
>>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>>> at 
>>> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>>> at scala.concurrent.Promise$class.complete(Promise.scala:55)
>>> at 
>>> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>>> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>>> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>>> at 
>>>

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar

You are quite right. I am getting this error while profiling my module to
see what is the minimum resources I can use to achieve my SLA.
My point is that if resource constraint creates this problem, then this
issue is just waiting to happen in a larger scenario(Though the probability
of happening will be less)

I hope to get some guidance as to what parameter I can use in order to
totally avoid this issue.
I am guessing spark.shuffle.io.preferDirectBufs = false but I am not sure.

..Manas

On Tue, Mar 15, 2016 at 2:30 PM, Iain Cundy  wrote:

> Hi Manas
>
>
>
> I saw a very similar problem while using mapWithState. Timeout on
> BlockManager remove leading to a stall.
>
>
>
> In my case it only occurred when there was a big backlog of micro-batches,
> combined with a shortage of memory. The adding and removing of blocks
> between new and old tasks was interleaved.  Don’t really know what caused
> it. Once I fixed the problems that were causing the backlog – in my case
> state compaction not working with Kryo in 1.6.0 (with Kryo workaround
> rather than patch) – I’ve never seen it again.
>
>
>
> So if you’ve got a backlog or other issue to fix maybe you’ll get lucky
> too J.
>
>
>
> Cheers
>
> Iain
>
>
>
> *From:* manas kar [mailto:poorinsp...@gmail.com]
> *Sent:* 15 March 2016 14:49
> *To:* Ted Yu
> *Cc:* user
> *Subject:* [MARKETING] Re: mapwithstate Hangs with Error cleaning
> broadcast
>
>
>
> I am using spark 1.6.
>
> I am not using any broadcast variable.
>
> This broadcast variable is probably used by the state management of
> mapwithState
>
>
>
> ...Manas
>
>
>
> On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu  wrote:
>
> Which version of Spark are you using ?
>
>
>
> Can you show the code snippet w.r.t. broadcast variable ?
>
>
>
> Thanks
>
>
>
> On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar 
> wrote:
>
> Hi,
>  I have a streaming application that takes data from a kafka topic and uses
> mapwithstate.
>  After couple of hours of smooth running of the application I see a problem
> that seems to have stalled my application.
> The batch seems to have been stuck after the following error popped up.
> Has anyone seen this error or know what causes it?
> 14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
> cleaning broadcast 7456
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at
>
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
> at
>
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
> at
>
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
> at
>
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
> at
>
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
> at
>
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
> at
>
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
> at scala.Option.foreach(Option.scala:236)
> at
>
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
> at
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
> at
> org.apache.spark.ContextCleaner.org
> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
> at
> org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [120 seconds]
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
>
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> ... 12 more
>
>
>
>
> --
> View this message in context:

Re: Microsoft SQL dialect issues

2016-03-15 Thread Mich Talebzadeh

Hi,

Can you please clarify what you are trying to achieve and I guess you mean
Transact_SQL for MSSQL?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 19:09, Andrés Ivaldi  wrote:

> Hello, I'm trying to use MSSQL, storing data on MSSQL but i'm having
> dialect problems
> I found this
>
> https://mail-archives.apache.org/mod_mbox/spark-issues/201510.mbox/%3cjira.12901078.1443461051000.34556.1444123886...@atlassian.jira%3E
>
> That is what is happening to me, It's possible to define the dialect? so I
> can override the default for SQLServer?
>
> Regards.
>
> --
> Ing. Ivaldi Andres
>

Microsoft SQL dialect issues

2016-03-15 Thread Andrés Ivaldi

Hello, I'm trying to use MSSQL, storing data on MSSQL but i'm having
dialect problems
I found this
https://mail-archives.apache.org/mod_mbox/spark-issues/201510.mbox/%3cjira.12901078.1443461051000.34556.1444123886...@atlassian.jira%3E

That is what is happening to me, It's possible to define the dialect? so I
can override the default for SQLServer?

Regards.

-- 
Ing. Ivaldi Andres

How to select from table name using IF(condition, tableA, tableB)?

2016-03-15 Thread Rex X

I want to do a query based on a logic condition to query between two tables.

select *
from if(A>B, tableA, tableB)


But "if" function in Hive cannot work within FROM above. Any idea how?

Re: newbie HDFS S3 best practices

2016-03-15 Thread Frank Austin Nothaft

Hard to say with #1 without knowing your application’s characteristics; for #2, 
we use conductor  with IAM roles, 
.boto/.aws/credentials files.

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

> On Mar 15, 2016, at 11:45 AM, Andy Davidson  
> wrote:
> 
> We use the spark-ec2 script to create AWS clusters as needed (we do not use 
> AWS EMR)
> will we get better performance if we copy data to HDFS before we run instead 
> of reading directly from S3?
>  2. What is a good way to move results from HDFS to S3?
> 
> 
> It seems like there are many ways to bulk copy to s3. Many of them require we 
> explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@ 
> . This seems like a 
> bad idea? 
> 
> What would you recommend?
> 
> Thanks
> 
> Andy
> 
>

Re: bug? using withColumn with colName with dot can't replace column

2016-03-15 Thread Jan Štěrba

First off, I would advise against having dots in column names, thats
just playing with fire.

Second the exception is really strange since spark is complaining
about a completely unrelated column. I would like to see the df schema
before the exception was thrown.
--
Jan Sterba
https://twitter.com/honzasterba | http://flickr.com/honzasterba |
http://500px.com/honzasterba


On Tue, Mar 15, 2016 at 6:51 PM, Emmanuel  wrote:
>
> In Spark 1.6
>
> if I do (column name has dot in it, but is not a nested column):
>
> df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))
>
>
> scala> df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))
> org.apache.spark.sql.AnalysisException: cannot resolve 'raw.minOfDay' given
> input columns raw.hourOfDay_2, raw.dayOfWeek, raw.sensor2, raw.hourOfDay,
> raw.minOfDay;
> at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
> at
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
> at
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
> at
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
> at
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>
>
> but if I do:
>
> df = df.withColumn("raw.hourOfDay_2", df.col("`raw.hourOfDay`"))
>
> scala> df.printSchema
> root
>  |-- raw.hourOfDay: long (nullable = true)
>  |-- raw.minOfDay: long (nullable = true)
>  |-- raw.dayOfWeek: long (nullable = true)
>  |-- raw.sensor2: long (nullable = true)
>  |-- raw.hourOfDay_2: long (nullable = true)
>
>
> it works fine (i.e. column is created).
>
> The only difference is that the name "raw.hourOfDay_2" does not exist yet,
> and is properly created as a colName with dot, not as a nested column.
>
> The documentation however says that if the column exists it will replace it,
> but it seems there is a miss-interpretation of the column name as a nested
> column
>
>
> defwithColumn(colName: String, col: Column): DataFrame
>
> Returns a new DataFrame by adding a column or replacing the existing column
> that has the same name.
>
>
>
>
> Any thoughts on why the different behavior when the column exists?
>
>
> Thanks
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson

We use the spark-ec2 script to create AWS clusters as needed (we do not use
AWS EMR)
1. will we get better performance if we copy data to HDFS before we run
instead of reading directly from S3?
 2. What is a good way to move results from HDFS to S3?


It seems like there are many ways to bulk copy to s3. Many of them require
we explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
 . This seems like a
bad idea? 

What would you recommend?

Thanks

Andy

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh

The data in the title is different, so to correct the data in the column
requires to find out what is the correct data  and then replace.

To find the correct data could be tedious but if some mechanism is in place
which can help to group the partially matched data then it might help to do
the further processing.

I am kind of stuck.



On Tue, Mar 15, 2016 at 10:50 AM, Suniti Singh 
wrote:

> Is it always the case that one title is a substring of another ? -- Not
> always. One title can have values like D.O.C, doctor_{areacode},
> doc_{dep,areacode}
>
> On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileet 
> wrote:
>
>> I think you need some sort of fuzzy join ?
>> Is it always the case that one title is a substring of another ?
>>
>> On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have two tables with same schema but different data. I have to join
>>> the tables based on one column and then do a group by the same column name.
>>>
>>> now the data in that column in two table might/might not exactly match.
>>> (Ex - column name is "title". Table1. title = "doctor"   and Table2. title
>>> = "doc") doctor and doc are actually same titles.
>>>
>>> From performance point of view where i have data volume in TB , i am not
>>> sure if i can achieve this using the sql statement. What would be the best
>>> approach of solving this problem. Should i look for MLLIB apis?
>>>
>>> Spark Gurus any pointers?
>>>
>>> Thanks,
>>> Suniti
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Regards,*
>> Wail Alkowaileet
>>
>
>

RE: [MARKETING] Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Iain Cundy

Hi Manas

I saw a very similar problem while using mapWithState. Timeout on BlockManager 
remove leading to a stall.

In my case it only occurred when there was a big backlog of micro-batches, 
combined with a shortage of memory. The adding and removing of blocks between 
new and old tasks was interleaved.  Don’t really know what caused it. Once I 
fixed the problems that were causing the backlog – in my case state compaction 
not working with Kryo in 1.6.0 (with Kryo workaround rather than patch) – I’ve 
never seen it again.

So if you’ve got a backlog or other issue to fix maybe you’ll get lucky too ☺.

Cheers
Iain

From: manas kar [mailto:poorinsp...@gmail.com]
Sent: 15 March 2016 14:49
To: Ted Yu
Cc: user
Subject: [MARKETING] Re: mapwithstate Hangs with Error cleaning broadcast

I am using spark 1.6.
I am not using any broadcast variable.
This broadcast variable is probably used by the state management of mapwithState

...Manas

On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu 
> wrote:
Which version of Spark are you using ?

Can you show the code snippet w.r.t. broadcast variable ?

Thanks

On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar 
> wrote:
Hi,
 I have a streaming application that takes data from a kafka topic and uses
mapwithstate.
 After couple of hours of smooth running of the application I see a problem
that seems to have stalled my application.
The batch seems to have been stuck after the following error popped up.
Has anyone seen this error or know what causes it?
14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
cleaning broadcast 7456
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
at
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
at
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
at
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
at
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
at
org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[120 seconds]
at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

Best way to process values for key in sorted order

2016-03-15 Thread James Hammerton

Hi,

I need to process some events in a specific order based on a timestamp, for
each user in my data.

I had implemented this by using the dataframe sort method to sort by user
id and then sort by the timestamp secondarily, then do a
groupBy().mapValues() to process the events for each user.

However on re-reading the docs I see that groupByKey() does not guarantee
any ordering of the values, yet my code (which will fall over on out of
order events) seems to run OK so far, on a local mode but with a machine
with 8 CPUs.

I guess the easiest way to be certain would be to sort the values after the
groupByKey, but I'm wondering if using mapPartitions() to process all
entries in a partition would work, given I had pre-ordered the data?

This would require a bit more work to track when I switch from one user to
the next as I process the events, but if the original order has been
preserved on reading the events in, this should work.

Anyone know definitively if this is the case?

Regards,

James

Re: Parition RDD by key to create DataFrames

2016-03-15 Thread Davies Liu

I think you could create a DataFrame with schema (mykey, value1,
value2), then partition it by mykey when saving as parquet.

r2 = rdd.map((k, v) => Row(k, v._1, v._2))
df  = sqlContext.createDataFrame(r2, schema)
df.write.partitionBy("myKey").parquet(path)


On Tue, Mar 15, 2016 at 10:33 AM, Mohamed Nadjib MAMI
 wrote:
> Hi,
>
> I have a pair RDD of the form: (mykey, (value1, value2))
>
> How can I create a DataFrame having the schema [V1 String, V2 String] to
> store [value1, value2] and save it into a Parquet table named "mykey"?
>
> createDataFrame() method takes an RDD and a schema (StructType) in
> parameters. The schema is known up front ([V1 String, V2 String]), but
> getting an RDD by partitioning the original RDD based on the key is what I
> can't get my head around so far.
>
> Similar questions have been around (like
> http://stackoverflow.com/questions/25046199/apache-spark-splitting-pair-rdd-into-multiple-rdds-by-key-to-save-values)
> but they do not use DataFrames.
>
> Thanks in advance!
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

bug? using withColumn with colName with dot can't replace column

2016-03-15 Thread Emmanuel


In Spark 1.6
if I do (column name has dot in it, but is not a nested column):
df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))scala> df = 
df.withColumn("raw.hourOfDay", 
df.col("`raw.hourOfDay`"))org.apache.spark.sql.AnalysisException: cannot 
resolve 'raw.minOfDay' given input columns raw.hourOfDay_2, raw.dayOfWeek, 
raw.sensor2, raw.hourOfDay, raw.minOfDay;at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
but if I do:
df = df.withColumn("raw.hourOfDay_2", df.col("`raw.hourOfDay`"))scala> 
df.printSchema
root
 |-- raw.hourOfDay: long (nullable = true)
 |-- raw.minOfDay: long (nullable = true)
 |-- raw.dayOfWeek: long (nullable = true)
 |-- raw.sensor2: long (nullable = true)
 |-- raw.hourOfDay_2: long (nullable = true)
it works fine (i.e. column is created).
The only difference is that the name "raw.hourOfDay_2" does not exist yet, and 
is properly created as a colName with dot, not as a nested column.
The documentation however says that if the column exists it will replace it, 
but it seems there is a miss-interpretation of the column name as a nested 
column

defwithColumn(colName: String, col: Column): DataFrameReturns a new DataFrame 
by adding a column or replacing the existing column that has the same name.


Any thoughts on why the different behavior when the column exists?

Thanks

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh

Is it always the case that one title is a substring of another ? -- Not
always. One title can have values like D.O.C, doctor_{areacode},
doc_{dep,areacode}

On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileet 
wrote:

> I think you need some sort of fuzzy join ?
> Is it always the case that one title is a substring of another ?
>
> On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh 
> wrote:
>
>> Hi All,
>>
>> I have two tables with same schema but different data. I have to join the
>> tables based on one column and then do a group by the same column name.
>>
>> now the data in that column in two table might/might not exactly match.
>> (Ex - column name is "title". Table1. title = "doctor"   and Table2. title
>> = "doc") doctor and doc are actually same titles.
>>
>> From performance point of view where i have data volume in TB , i am not
>> sure if i can achieve this using the sql statement. What would be the best
>> approach of solving this problem. Should i look for MLLIB apis?
>>
>> Spark Gurus any pointers?
>>
>> Thanks,
>> Suniti
>>
>>
>>
>
>
> --
>
> *Regards,*
> Wail Alkowaileet
>

?????? mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Sea

Hi,manas:
 Maybe you can look at this bug: 
https://issues.apache.org/jira/browse/SPARK-13566






--  --
??: "manas kar";;
: 2016??3??15??(??) 10:48
??: "Ted Yu"; 
: "user"; 
: Re: mapwithstate Hangs with Error cleaning broadcast



I am using spark 1.6.I am not using any broadcast variable.
This broadcast variable is probably used by the state management of mapwithState


...Manas


On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu  wrote:
Which version of Spark are you using ?

Can you show the code snippet w.r.t. broadcast variable ?


Thanks


On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar  wrote:
Hi,
  I have a streaming application that takes data from a kafka topic and uses
 mapwithstate.
  After couple of hours of smooth running of the application I see a problem
 that seems to have stalled my application.
 The batch seems to have been stuck after the following error popped up.
 Has anyone seen this error or know what causes it?
 14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
 cleaning broadcast 7456
 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
 seconds]. This timeout is controlled by spark.rpc.askTimeout
 at
 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
 at
 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
 at
 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
 at
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
 at
 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
 at
 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
 at
 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
 at
 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
 at
 org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
 at
 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
 at
 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
 at scala.Option.foreach(Option.scala:236)
 at
 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
 at
 org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
 at
 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
 at
 org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [120 seconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
 ... 12 more
 
 
 
 
 --
 View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Parition RDD by key to create DataFrames

2016-03-15 Thread Mohamed Nadjib MAMI


Hi,

I have a pair RDD of the form: (mykey, (value1, value2))

How can I create a DataFrame having the schema [V1 String, V2 String] to 
store [value1, value2] and save it into a Parquet table named "mykey"?


/createDataFrame()/ method takes an RDD and a schema (StructType) in 
parameters. The schema is known up front ([V1 String, V2 String]), but 
getting an RDD by partitioning the original RDD based on the key is what 
I can't get my head around so far.


Similar questions have been around (like 
http://stackoverflow.com/questions/25046199/apache-spark-splitting-pair-rdd-into-multiple-rdds-by-key-to-save-values) 
but they do not use DataFrames.


Thanks in advance!

Questions about Spark On Mesos

2016-03-15 Thread Shuai Lin

Hi list,

We (scrapinghub) are planning to deploy spark in a 10+ node cluster, mainly
for processing data in HDFS and kafka streaming. We are thinking of using
mesos instead of yarn as the cluster resource manager so we can use docker
container as the executor and makes deployment easier. But there is one
import thing before making the decision: data locality.

If we run spark on mesos, can it achieve good data locality when processing
HDFS data? I think spark on yarn can achieve that out of the box, but not
sure whether spark on mesos could do that.

I've searched through the archive of the list, but didn't find a helpful
answer yet. Any reply is appreciated.

Regards,
Shuai

Re: create hive context in spark application

2016-03-15 Thread Antonio Si

Thanks Akhil.

Yes, spark-shell works fine.

In my app, I have a Restful service and from the Restful service, I am
calling the spark-api to do some hiveql. That's why
I am not using spark-submit.

Thanks.

Antonio.

On Tue, Mar 15, 2016 at 12:02 AM, Akhil Das 
wrote:

> Did you ry submitting your application with spark-submit
> ? You
> can also try opening a spark-shell and see if it picks up your
> hive-site.xml.
>
> Thanks
> Best Regards
>
> On Tue, Mar 15, 2016 at 11:58 AM, antoniosi  wrote:
>
>> Hi,
>>
>> I am trying to connect to a hive metastore deployed in a oracle db. I have
>> the hive configuration
>> specified in the hive-site.xml. I put the hive-site.xml under
>> $SPARK_HOME/conf. If I run spark-shell,
>> everything works fine. I can create hive database, tables and query the
>> tables.
>>
>> However, when I try to do that in a spark application, running in local
>> mode, i.e., I have
>> sparkConf.setMaster("local[*]").setSparkHome(> installation>),
>> it does not seem
>> to pick up the hive-site.xml. It still uses the local derby Hive metastore
>> instead of the oracle
>> metastore that I defined in hive-site.xml. If I add the hive-site.xml
>> explicitly on the classpath, I am
>> getting the following error:
>>
>> Caused by:
>> org.datanucleus.api.jdo.exceptions.TransactionNotActiveException:
>> Transaction is not active. You either need to define a transaction around
>> this, or run your PersistenceManagerFactory with 'NontransactionalRead'
>> and
>> 'NontransactionalWrite' set to 'true'
>> FailedObject:org.datanucleus.exceptions.TransactionNotActiveException:
>> Transaction is not active. You either need to define a transaction around
>> this, or run your PersistenceManagerFactory with 'NontransactionalRead'
>> and
>> 'NontransactionalWrite' set to 'true'
>> at
>>
>> org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:396)
>> at
>> org.datanucleus.api.jdo.JDOTransaction.rollback(JDOTransaction.java:186)
>> at
>>
>> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.runTestQuery(MetaStoreDirectSql.java:204)
>> at
>>
>> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:137)
>> at
>>
>> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:295)
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>> at
>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>> at
>>
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>> at
>>
>> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>> at
>>
>> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>> at
>>
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>> at
>>
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
>> at
>>
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:624)
>> at
>>
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
>> at
>>
>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
>> at
>>
>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
>> at
>>
>> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
>> at
>>
>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
>> at
>>
>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>>
>> This happens when I try to new a HiveContext in my code.
>>
>> How do I ask Spark to look at the hive-site.xml in the $SPARK_HOME/conf
>> directory in my spark application?
>>
>> Thanks very much. Any pointer will be much appreciated.
>>
>> Regards,
>>
>> Antonio.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/create-hive-context-in-spark-application-tp26496.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: sparkR issues ?

2016-03-15 Thread Alex Kozlov

Hi Roni, you can probably rename the as.data.frame in
$SPARK_HOME/R/pkg/R/DataFrame.R and re-install SparkR by running
install-dev.sh

On Tue, Mar 15, 2016 at 8:46 AM, roni  wrote:

> Hi ,
>  Is there a work around for this?
>  Do i need to file a bug for this?
> Thanks
> -R
>
> On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui  wrote:
>
>> It seems as.data.frame() defined in SparkR convers the versions in R base
>> package.
>>
>> We can try to see if we can change the implementation of as.data.frame()
>> in SparkR to avoid such covering.
>>
>>
>>
>> *From:* Alex Kozlov [mailto:ale...@gmail.com]
>> *Sent:* Tuesday, March 15, 2016 2:59 PM
>> *To:* roni 
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: sparkR issues ?
>>
>>
>>
>> This seems to be a very unfortunate name collision.  SparkR defines it's
>> own DataFrame class which shadows what seems to be your own definition.
>>
>>
>>
>> Is DataFrame something you define?  Can you rename it?
>>
>>
>>
>> On Mon, Mar 14, 2016 at 10:44 PM, roni  wrote:
>>
>> Hi,
>>
>>  I am working with bioinformatics and trying to convert some scripts to
>> sparkR to fit into other spark jobs.
>>
>>
>>
>> I tries a simple example from a bioinf lib and as soon as I start sparkR
>> environment it does not work.
>>
>>
>>
>> code as follows -
>>
>> countData <- matrix(1:100,ncol=4)
>>
>> condition <- factor(c("A","A","B","B"))
>>
>> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~
>> condition)
>>
>>
>>
>> Works if i dont initialize the sparkR environment.
>>
>>  if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
>> following error
>>
>>
>>
>> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
>> condition)
>>
>> Error in DataFrame(colData, row.names = rownames(colData)) :
>>
>>   cannot coerce class "data.frame" to a DataFrame
>>
>>
>>
>> I am really stumped. I am not using any spark function , so i would
>> expect it to work as a simple R code.
>>
>> why it does not work?
>>
>>
>>
>> Appreciate  the help
>>
>> -R
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Alex Kozlov
>> (408) 507-4987
>> (650) 887-2135 efax
>> ale...@gmail.com
>>
>
>


-- 
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com

Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov

Hi,

This is an interesting point of view. I thought the HashPartitioner works
completely differently.
Here's my understanding - the HashPartitioner defines how keys are
distributed within a dataset between the different partitions, but play no
role in assigning each partition for processing by executors.
I may be wrong so please let me know if thats the case :)

In my case the partitions are even - so the dataset is distributed evenly
between partitions. Its just that they are processed very unevenly - 1-2
nodes handle much more partitions than the other cluster members.

Also note that the cluster is made of identical nodes in terms of HW so its
not like one of the nodes just "works" quicker.

Thanks,
Borislav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502p26508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-03-15 Thread Nan Zhu

Dear Spark Users and Developers, 

We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are happy 
to announce the release of XGBoost4J 
(http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html),
 a Portable Distributed XGBoost in Spark, Flink and Dataflow 

XGBoost is an optimized distributed gradient boosting library designed to be 
highly efficient, flexible and portable.XGBoost provides a parallel tree 
boosting (also known as GBDT, GBM) that solve many data science problems in a 
fast and accurate way. It has been the winning solution for many machine 
learning scenarios, ranging from Machine Learning Challenges 
(https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions)
 to Industrial User Cases 
(https://github.com/dmlc/xgboost/tree/master/demo#usecases) 

XGBoost4J is a new package in XGBoost aiming to provide the clean Scala/Java 
APIs and the seamless integration with the mainstream data processing platform, 
like Apache Spark. With XGBoost4J, users can run XGBoost as a stage of Spark 
job and build a unified pipeline from ETL to Model training to data product 
service within Spark, instead of jumping across two different systems, i.e. 
XGBoost and Spark. (Example: 
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/DistTrainWithSpark.scala)

Today, we release the first version of XGBoost4J to bring more choices to the 
Spark users who are seeking the solutions to build highly efficient data 
analytic platform and enrich the Spark ecosystem. We will keep moving forward 
to integrate with more features of Spark. Of course, you are more than welcome 
to join us and contribute to the project!

For more details of distributed XGBoost, you can refer to the recently 
published paper: http://arxiv.org/abs/1603.02754

Best, 

-- 
Nan Zhu
http://codingcat.me

Re: sparkR issues ?

2016-03-15 Thread roni

Hi ,
 Is there a work around for this?
 Do i need to file a bug for this?
Thanks
-R

On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui  wrote:

> It seems as.data.frame() defined in SparkR convers the versions in R base
> package.
>
> We can try to see if we can change the implementation of as.data.frame()
> in SparkR to avoid such covering.
>
>
>
> *From:* Alex Kozlov [mailto:ale...@gmail.com]
> *Sent:* Tuesday, March 15, 2016 2:59 PM
> *To:* roni 
> *Cc:* user@spark.apache.org
> *Subject:* Re: sparkR issues ?
>
>
>
> This seems to be a very unfortunate name collision.  SparkR defines it's
> own DataFrame class which shadows what seems to be your own definition.
>
>
>
> Is DataFrame something you define?  Can you rename it?
>
>
>
> On Mon, Mar 14, 2016 at 10:44 PM, roni  wrote:
>
> Hi,
>
>  I am working with bioinformatics and trying to convert some scripts to
> sparkR to fit into other spark jobs.
>
>
>
> I tries a simple example from a bioinf lib and as soon as I start sparkR
> environment it does not work.
>
>
>
> code as follows -
>
> countData <- matrix(1:100,ncol=4)
>
> condition <- factor(c("A","A","B","B"))
>
> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)
>
>
>
> Works if i dont initialize the sparkR environment.
>
>  if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
> following error
>
>
>
> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
> condition)
>
> Error in DataFrame(colData, row.names = rownames(colData)) :
>
>   cannot coerce class "data.frame" to a DataFrame
>
>
>
> I am really stumped. I am not using any spark function , so i would expect
> it to work as a simple R code.
>
> why it does not work?
>
>
>
> Appreciate  the help
>
> -R
>
>
>
>
>
>
>
> --
>
> Alex Kozlov
> (408) 507-4987
> (650) 887-2135 efax
> ale...@gmail.com
>

Re: sparkR issues ?

2016-03-15 Thread roni

Alex,
 No I have not defined he "dataframe" its the spark default Dataframe. That
line is just casting Factor as datarame to send to the function.
Thanks
-R

On Mon, Mar 14, 2016 at 11:58 PM, Alex Kozlov  wrote:

> This seems to be a very unfortunate name collision.  SparkR defines it's
> own DataFrame class which shadows what seems to be your own definition.
>
> Is DataFrame something you define?  Can you rename it?
>
> On Mon, Mar 14, 2016 at 10:44 PM, roni  wrote:
>
>> Hi,
>>  I am working with bioinformatics and trying to convert some scripts to
>> sparkR to fit into other spark jobs.
>>
>> I tries a simple example from a bioinf lib and as soon as I start sparkR
>> environment it does not work.
>>
>> code as follows -
>> countData <- matrix(1:100,ncol=4)
>> condition <- factor(c("A","A","B","B"))
>> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~
>> condition)
>>
>> Works if i dont initialize the sparkR environment.
>>  if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
>> following error
>>
>> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
>> condition)
>> Error in DataFrame(colData, row.names = rownames(colData)) :
>>   cannot coerce class "data.frame" to a DataFrame
>>
>> I am really stumped. I am not using any spark function , so i would
>> expect it to work as a simple R code.
>> why it does not work?
>>
>> Appreciate  the help
>> -R
>>
>>
>
>
> --
> Alex Kozlov
> (408) 507-4987
> (650) 887-2135 efax
> ale...@gmail.com
>

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta

Once again, please use roles, there is no way that you have to specify the
access keys in the URI under any situation. Please read Amazon
documentation and they will say the same. The only situation when you use
the access keys in URI is when you have not read the Amazon documentation :)

Regards,
Gourav

On Tue, Mar 15, 2016 at 3:22 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> There are many solutions to a problem.
>
> Also understand that sometimes your situation might be such. For ex what
> if you are accessing S3 from your Spark job running in your continuous
> integration server sitting in your data center or may be a box under your
> desk. And sometimes you are just trying something.
>
> Also understand that sometimes you want answers to solve your problem at
> hand without redirecting you to something else. Understand what you
> suggested is an appropriate way of doing it, which I myself have proposed
> before, but that doesn't solve the OP's problem at hand.
>
> Regards
> Sab
> On 15-Mar-2016 8:27 pm, "Gourav Sengupta" 
> wrote:
>
>> Oh!!! What the hell
>>
>> Please never use the URI
>>
>> *s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY.*That is a major cause of
>> pain, security issues, code maintenance issues and ofcourse something that
>> Amazon strongly suggests that we do not use. Please use roles and you will
>> not have to worry about security.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan <
>> sabarish@gmail.com> wrote:
>>
>>> You have a slash before the bucket name. It should be @.
>>>
>>> Regards
>>> Sab
>>> On 15-Mar-2016 4:03 pm, "Yasemin Kaya"  wrote:
>>>
 Hi,

 I am using Spark 1.6.0 standalone and I want to read a txt file from S3
 bucket named yasemindeneme and my file name is deneme.txt. But I am getting
 this error. Here is the simple code
 
 Exception in thread "main" java.lang.IllegalArgumentException: Invalid
 hostname in URI s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
 /yasemindeneme/deneme.txt
 at
 org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
 at
 org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)


 I try 2 options
 *sc.hadoopConfiguration() *and
 *sc.textFile("s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt/");*

 Also I did export AWS_ACCESS_KEY_ID= .
  export AWS_SECRET_ACCESS_KEY=
 But there is no change about error.

 Could you please help me about this issue?


 --
 hiç ender hiç

>>>
>>

Re: Spark work distribution among execs

2016-03-15 Thread manasdebashiskar

Your input is skewed in terms of the default hash partitioner that is used.
Your options are to use a custom partitioner that can re-distribute the data
evenly among your executors.

I think you will see the same behaviour when you use more executors. It is
just that the data skew appears to be less. To prove the same, use a even
bigger input for your job.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502p26506.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan

There are many solutions to a problem.

Also understand that sometimes your situation might be such. For ex what if
you are accessing S3 from your Spark job running in your continuous
integration server sitting in your data center or may be a box under your
desk. And sometimes you are just trying something.

Also understand that sometimes you want answers to solve your problem at
hand without redirecting you to something else. Understand what you
suggested is an appropriate way of doing it, which I myself have proposed
before, but that doesn't solve the OP's problem at hand.

Regards
Sab
On 15-Mar-2016 8:27 pm, "Gourav Sengupta"  wrote:

> Oh!!! What the hell
>
> Please never use the URI
>
> *s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY.*That is a major cause of
> pain, security issues, code maintenance issues and ofcourse something that
> Amazon strongly suggests that we do not use. Please use roles and you will
> not have to worry about security.
>
> Regards,
> Gourav Sengupta
>
> On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan <
> sabarish@gmail.com> wrote:
>
>> You have a slash before the bucket name. It should be @.
>>
>> Regards
>> Sab
>> On 15-Mar-2016 4:03 pm, "Yasemin Kaya"  wrote:
>>
>>> Hi,
>>>
>>> I am using Spark 1.6.0 standalone and I want to read a txt file from S3
>>> bucket named yasemindeneme and my file name is deneme.txt. But I am getting
>>> this error. Here is the simple code
>>> 
>>> Exception in thread "main" java.lang.IllegalArgumentException: Invalid
>>> hostname in URI s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
>>> /yasemindeneme/deneme.txt
>>> at
>>> org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
>>> at
>>> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
>>>
>>>
>>> I try 2 options
>>> *sc.hadoopConfiguration() *and
>>> *sc.textFile("s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt/");*
>>>
>>> Also I did export AWS_ACCESS_KEY_ID= .
>>>  export AWS_SECRET_ACCESS_KEY=
>>> But there is no change about error.
>>>
>>> Could you please help me about this issue?
>>>
>>>
>>> --
>>> hiç ender hiç
>>>
>>
>

Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2016-03-15 Thread jkukul

Hi Eric (or rather: anyone who's experiencing similar situation),

I think your problem was, that the /--files/ parameter was provided after
the application jar. Your command should have looked like this, instead:

./bin/spark-submit --class edu.bjut.spark.SparkPageRank --master
yarn-cluster --num-executors 5 --executor-memory 2g
--executor-cores 1 *--files log4j.properties*
/data/hadoopspark/MySparkTest.jar hdfs://master:8000/srcdata/searchengine/*
5 5 hdfs://master:8000/resultdata/searchengine/2014102001/ 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p26505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta

Oh!!! What the hell

Please never use the URI

*s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY.*That is a major cause of
pain, security issues, code maintenance issues and ofcourse something that
Amazon strongly suggests that we do not use. Please use roles and you will
not have to worry about security.

Regards,
Gourav Sengupta

On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan  wrote:

> You have a slash before the bucket name. It should be @.
>
> Regards
> Sab
> On 15-Mar-2016 4:03 pm, "Yasemin Kaya"  wrote:
>
>> Hi,
>>
>> I am using Spark 1.6.0 standalone and I want to read a txt file from S3
>> bucket named yasemindeneme and my file name is deneme.txt. But I am getting
>> this error. Here is the simple code
>> 
>> Exception in thread "main" java.lang.IllegalArgumentException: Invalid
>> hostname in URI s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
>> /yasemindeneme/deneme.txt
>> at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
>> at
>> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
>>
>>
>> I try 2 options
>> *sc.hadoopConfiguration() *and
>> *sc.textFile("s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt/");*
>>
>> Also I did export AWS_ACCESS_KEY_ID= .
>>  export AWS_SECRET_ACCESS_KEY=
>> But there is no change about error.
>>
>> Could you please help me about this issue?
>>
>>
>> --
>> hiç ender hiç
>>
>

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar

I am using spark 1.6.
I am not using any broadcast variable.
This broadcast variable is probably used by the state management of
mapwithState

...Manas

On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu  wrote:

> Which version of Spark are you using ?
>
> Can you show the code snippet w.r.t. broadcast variable ?
>
> Thanks
>
> On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar 
> wrote:
>
>> Hi,
>>  I have a streaming application that takes data from a kafka topic and
>> uses
>> mapwithstate.
>>  After couple of hours of smooth running of the application I see a
>> problem
>> that seems to have stalled my application.
>> The batch seems to have been stuck after the following error popped up.
>> Has anyone seen this error or know what causes it?
>> 14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
>> cleaning broadcast 7456
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>> at
>> org.apache.spark.rpc.RpcTimeout.org
>> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>> at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>> at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>> at
>>
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>> at
>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>> at
>>
>> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
>> at
>>
>> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
>> at
>>
>> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
>> at
>>
>> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
>> at
>>
>> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
>> at
>>
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
>> at
>>
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
>> at scala.Option.foreach(Option.scala:236)
>> at
>>
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
>> at
>> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
>> at
>> org.apache.spark.ContextCleaner.org
>> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
>> at
>> org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [120 seconds]
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>> at
>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>> at
>>
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>> at scala.concurrent.Await$.result(package.scala:107)
>> at
>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>> ... 12 more
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov

Hi,

Yes, I'm running the executors with 8 cores each. I also have properly
configured executor memory, driver memory, num execs and so on in submit
cmd.
I'm a long time spark user, please lets skip the dummy cmd configuration
stuff and dive in the interesting stuff :)

Another strange thing I've noticed is that this behaviour seems to be
exposed only for reading jsons:
- reading json from remote hdfs -> uneven executor performance
- reading parquets from remote hdfs -> even executor performance

>>What do you mean by the difference between the nodes is huge ?
When I look at the Input column in the Executors tab of the Spark WebUI the
values for the nodes that do the work vs all others is huge. For example in
the image below the diff is x4, but sometimes I've seen in x10 in the same
usecase.
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502p26504.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Ted Yu

Which version of Spark are you using ?

Can you show the code snippet w.r.t. broadcast variable ?

Thanks

On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar 
wrote:

> Hi,
>  I have a streaming application that takes data from a kafka topic and uses
> mapwithstate.
>  After couple of hours of smooth running of the application I see a problem
> that seems to have stalled my application.
> The batch seems to have been stuck after the following error popped up.
> Has anyone seen this error or know what causes it?
> 14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
> cleaning broadcast 7456
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at
>
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
> at
>
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
> at
>
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
> at
>
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
> at
>
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
> at
>
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
> at
>
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
> at scala.Option.foreach(Option.scala:236)
> at
>
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
> at
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
> at
> org.apache.spark.ContextCleaner.org
> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
> at
> org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [120 seconds]
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
>
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> ... 12 more
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan

You have a slash before the bucket name. It should be @.

Regards
Sab
On 15-Mar-2016 4:03 pm, "Yasemin Kaya"  wrote:

> Hi,
>
> I am using Spark 1.6.0 standalone and I want to read a txt file from S3
> bucket named yasemindeneme and my file name is deneme.txt. But I am getting
> this error. Here is the simple code
> 
> Exception in thread "main" java.lang.IllegalArgumentException: Invalid
> hostname in URI s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
> /yasemindeneme/deneme.txt
> at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
> at
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
>
>
> I try 2 options
> *sc.hadoopConfiguration() *and
> *sc.textFile("s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt/");*
>
> Also I did export AWS_ACCESS_KEY_ID= .
>  export AWS_SECRET_ACCESS_KEY=
> But there is no change about error.
>
> Could you please help me about this issue?
>
>
> --
> hiç ender hiç
>

Re: Spark work distribution among execs

2016-03-15 Thread Chitturi Padma

By default spark uses 2 executors with one core each, have you allocated
more executors using the command line args as -
--num-executors 25 --executor-cores x  ???

What do you mean by the difference between the nodes is huge ?

Regards,
Padma Ch

On Tue, Mar 15, 2016 at 6:57 PM, bkapukaranov [via Apache Spark User List] <
ml-node+s1001560n2650...@n3.nabble.com> wrote:

> Hi,
>
> I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster.
> I observe a very strange issue.
> I run a simple job that reads about 1TB of json logs from a remote HDFS
> cluster and converts them to parquet, then saves them to the local HDFS of
> the Hadoop cluster.
>
> I run it with 25 executors with sufficient resources. However the strange
> thing is that the job only uses 2 executors to do most of the read work.
>
> For example when I go to the Executors' tab in the Spark UI and look at
> the "Input" column, the difference between the nodes is huge, sometimes 20G
> vs 120G.
>
> 1. What is the cause for this behaviour?
> 2. Any ideas how to achieve a more balanced performance?
>
> Thanks,
> Borislav
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502p26503.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta

Hi,

Try starting your clusters with roles, and you will not have to configure,
hard code anything at all.

Let me know in case you need any help with this.


Regards,
Gourav Sengupta

On Tue, Mar 15, 2016 at 11:32 AM, Yasemin Kaya  wrote:

> Hi Safak,
>
> I changed the Keys but there is no change.
>
> Best,
> yasemin
>
>
> 2016-03-15 12:46 GMT+02:00 Şafak Serdar Kapçı :
>
>> Hello Yasemin,
>> Maybe your key id or access key has special chars like backslash or
>> something. You need to change it.
>> Best Regards,
>> Safak.
>>
>> 2016-03-15 12:33 GMT+02:00 Yasemin Kaya :
>>
>>> Hi,
>>>
>>> I am using Spark 1.6.0 standalone and I want to read a txt file from S3
>>> bucket named yasemindeneme and my file name is deneme.txt. But I am getting
>>> this error. Here is the simple code
>>> 
>>> Exception in thread "main" java.lang.IllegalArgumentException: Invalid
>>> hostname in URI s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
>>> /yasemindeneme/deneme.txt
>>> at
>>> org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
>>> at
>>> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
>>>
>>>
>>> I try 2 options
>>> *sc.hadoopConfiguration() *and
>>> *sc.textFile("s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt/");*
>>>
>>> Also I did export AWS_ACCESS_KEY_ID= .
>>>  export AWS_SECRET_ACCESS_KEY=
>>> But there is no change about error.
>>>
>>> Could you please help me about this issue?
>>>
>>>
>>> --
>>> hiç ender hiç
>>>
>>
>>
>
>
> --
> hiç ender hiç
>

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh

Thanks the maven structure is identical to sbt. just sbt file I will have
to replace with pom.xml

I will use your pom.xml to start with it.

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 13:12, Chandeep Singh  wrote:

> You can build using maven from the command line as well.
>
> This layout should give you an idea and here are some resources -
> http://www.scala-lang.org/old/node/345
>
> project/
>pom.xml   -  Defines the project
>src/
>   main/
>   java/ - Contains all java code that will go in your final artifact.
>   See maven-compiler-plugin 
>  for details
>   scala/ - Contains all scala code that will go in your final 
> artifact.
>See maven-scala-plugin 
>  for details
>   resources/ - Contains all static files that should be available on 
> the classpath
>in the final artifact.  See maven-resources-plugin 
>  for details
>   webapp/ - Contains all content for a web application (jsps, css, 
> images, etc.)
> See maven-war-plugin 
>  for details
>  site/ - Contains all apt or xdoc files used to create a project website.
>  See maven-site-plugin 
>  for details
>  test/
>  java/ - Contains all java code used for testing.
>  See maven-compiler-plugin 
>  for details
>  scala/ - Contains all scala code used for testing.
>   See maven-scala-plugin 
>  for details
>  resources/ - Contains all static content that should be available on 
> the
>   classpath during testing.   See maven-resources-plugin 
>  for details
>
>
>
> On Mar 15, 2016, at 12:38 PM, Chandeep Singh  wrote:
>
> Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/
>
> Once you have it setup, File -> New -> Other -> MavenProject -> Next /
> Finish. You’ll see a default POM.xml which you can modify / replace.
>
> 
>
> Here is some documentation that should help:
> http://scala-ide.org/docs/tutorials/m2eclipse/
>
> I’m using the same Eclipse build as you on my Mac. I mostly build a shaded
> JAR and SCP it to the cluster.
>
> On Mar 15, 2016, at 12:22 PM, Mich Talebzadeh 
> wrote:
>
> Great Chandeep. I also have Eclipse Scala IDE below
>
> scala IDE build of Eclipse SDK
> Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe
>
> I am no expert on Eclipse so if I create project called ImportCSV where do
> I need to put the pom file or how do I reference it please. My Eclipse runs
> on a Linux host so it cab access all the directories that sbt project
> accesses? I also believe there will not be any need for external jar files
> in builkd path?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 March 2016 at 12:15, Chandeep Singh  wrote:
>
>> Btw, just to add to the confusion ;) I use Maven as well since I moved
>> from Java to Scala but everyone I talk to has been recommending SBT for
>> Scala.
>>
>> I use the Eclipse Scala IDE to build. http://scala-ide.org/
>>
>> Here is my sample PoM. You can add dependancies based on your requirement.
>>
>> http://maven.apache.org/POM/4.0.0; xmlns:xsi="
>> http://www.w3.org/2001/XMLSchema-instance;
>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>> http://maven.apache.org/maven-v4_0_0.xsd;>
>> 4.0.0
>> spark
>> 1.0
>> ${project.artifactId}
>>
>> 
>> 1.7
>> 1.7
>> UTF-8
>> 2.10.4
>> 2.15.2
>> 
>>
>> 
>> 
>> cloudera-repo-releases
>> https://repository.cloudera.com/artifactory/repo/
>> 
>> 
>>
>> 
>> 
>> org.scala-lang
>> scala-library
>> ${scala.version}
>> 
>> 
>> org.apache.spark
>> spark-core_2.10
>> 1.5.0-cdh5.5.1
>> 
>> 
>> org.apache.spark
>> spark-mllib_2.10
>> 1.5.0-cdh5.5.1
>> 
>> 
>> org.apache.spark
>> spark-hive_2.10
>> 1.5.0
>> 
>>
>> 
>> 
>> src/main/scala
>> src/test/scala
>> 
>> 
>> org.scala-tools
>> maven-scala-plugin
>> ${maven-scala-plugin.version}
>> 
>> 
>> 
>> compile
>> testCompile
>> 
>> 
>> 
>> 
>> 
>> -Xms64m
>> -Xmx1024m
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-shade-plugin

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh

Yes, sbt uses the same structure as maven for source files.

> On Mar 15, 2016, at 1:53 PM, Mich Talebzadeh  
> wrote:
> 
> Thanks the maven structure is identical to sbt. just sbt file I will have to 
> replace with pom.xml
> 
> I will use your pom.xml to start with it.
> 
> Cheers
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 15 March 2016 at 13:12, Chandeep Singh  > wrote:
> You can build using maven from the command line as well.
> 
> This layout should give you an idea and here are some resources - 
> http://www.scala-lang.org/old/node/345 
> 
> 
> project/
>pom.xml   -  Defines the project
>src/
>   main/
>   java/ - Contains all java code that will go in your final artifact. 
>  
>   See maven-compiler-plugin 
>  for details
>   scala/ - Contains all scala code that will go in your final 
> artifact.  
>See maven-scala-plugin 
>  for details
>   resources/ - Contains all static files that should be available on 
> the classpath 
>in the final artifact.  See maven-resources-plugin 
>  for details
>   webapp/ - Contains all content for a web application (jsps, css, 
> images, etc.)  
> See maven-war-plugin 
>  for details
>  site/ - Contains all apt or xdoc files used to create a project website. 
>  
>  See maven-site-plugin 
>  for details   
>  test/
>  java/ - Contains all java code used for testing.   
>  See maven-compiler-plugin 
>  for details
>  scala/ - Contains all scala code used for testing.   
>   See maven-scala-plugin 
>  for details
>  resources/ - Contains all static content that should be available on 
> the 
>   classpath during testing.   See maven-resources-plugin 
>  for details
> 
> 
>> On Mar 15, 2016, at 12:38 PM, Chandeep Singh > > wrote:
>> 
>> Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/ 
>> 
>> 
>> Once you have it setup, File -> New -> Other -> MavenProject -> Next / 
>> Finish. You’ll see a default POM.xml which you can modify / replace. 
>> 
>> 
>> 
>> Here is some documentation that should help: 
>> http://scala-ide.org/docs/tutorials/m2eclipse/ 
>> 
>> 
>> I’m using the same Eclipse build as you on my Mac. I mostly build a shaded 
>> JAR and SCP it to the cluster.
>> 
>>> On Mar 15, 2016, at 12:22 PM, Mich Talebzadeh >> > wrote:
>>> 
>>> Great Chandeep. I also have Eclipse Scala IDE below
>>> 
>>> scala IDE build of Eclipse SDK
>>> Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe
>>> 
>>> I am no expert on Eclipse so if I create project called ImportCSV where do 
>>> I need to put the pom file or how do I reference it please. My Eclipse runs 
>>> on a Linux host so it cab access all the directories that sbt project 
>>> accesses? I also believe there will not be any need for external jar files 
>>> in builkd path?
>>> 
>>> Thanks
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 15 March 2016 at 12:15, Chandeep Singh >> > wrote:
>>> Btw, just to add to the confusion ;) I use Maven as well since I moved from 
>>> Java to Scala but everyone I talk to has been recommending SBT for Scala. 
>>> 
>>> I use the Eclipse Scala IDE to build. http://scala-ide.org/ 
>>> 
>>> 
>>> Here is my sample PoM. You can add dependancies based on your requirement.
>>> 
>>> http://maven.apache.org/POM/4.0.0 
>>> " 
>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance 
>>> "
>>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>>>

Spark work distribution among execs

2016-03-15 Thread bkapukaranov

Hi,

I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. 
I observe a very strange issue. 
I run a simple job that reads about 1TB of json logs from a remote HDFS
cluster and converts them to parquet, then saves them to the local HDFS of
the Hadoop cluster.

I run it with 25 executors with sufficient resources. However the strange
thing is that the job only uses 2 executors to do most of the read work.

For example when I go to the Executors' tab in the Spark UI and look at the
"Input" column, the difference between the nodes is huge, sometimes 20G vs
120G.

1. What is the cause for this behaviour?
2. Any ideas how to achieve a more balanced performance?

Thanks,
Borislav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Can we use spark inside a web service?

2016-03-15 Thread Andrés Ivaldi

Thanks Evan for the points. I had supposed what you said, but as I don't
have enough experience maybe I was missing something, thanks for the
answer!!

On Mon, Mar 14, 2016 at 7:22 PM, Evan Chan  wrote:

> Andres,
>
> A couple points:
>
> 1) If you look at my post, you can see that you could use Spark for
> low-latency - many sub-second queries could be executed in under a
> second, with the right technology.  It really depends on "real time"
> definition, but I believe low latency is definitely possible.
> 2) Akka-http over SparkContext - this is essentially what Spark Job
> Server does.  (it uses Spray, whic is the predecessor to akka-http
> we will upgrade once Spark 2.0 is incorporated)
> 3) Someone else can probably talk about Ignite, but it is based on a
> distributed object cache. So you define your objects in Java, POJOs,
> annotate which ones you want indexed, upload your jars, then you can
> execute queries.   It's a different use case than typical OLAP.
> There is some Spark integration, but then you would have the same
> bottlenecks going through Spark.
>
>
> On Fri, Mar 11, 2016 at 5:02 AM, Andrés Ivaldi  wrote:
> > nice discussion , I've a question about  Web Service with Spark.
> >
> > What Could be the problem using Akka-http as web service (Like play does
> ) ,
> > with one SparkContext created , and the queries over -http akka using
> only
> > the instance of  that SparkContext ,
> >
> > Also about Analytics , we are working on real- time Analytics and as
> Hemant
> > said Spark is not a solution for low latency queries. What about using
> > Ingite for that?
> >
> >
> > On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat 
> > wrote:
> >>
> >> Spark-jobserver is an elegant product that builds concurrency on top of
> >> Spark. But, the current design of DAGScheduler prevents Spark to become
> a
> >> truly concurrent solution for low latency queries. DagScheduler will
> turn
> >> out to be a bottleneck for low latency queries. Sparrow project was an
> >> effort to make Spark more suitable for such scenarios but it never made
> it
> >> to the Spark codebase. If Spark has to become a highly concurrent
> solution,
> >> scheduling has to be distributed.
> >>
> >> Hemant Bhanawat
> >> www.snappydata.io
> >>
> >> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly  wrote:
> >>>
> >>> great discussion, indeed.
> >>>
> >>> Mark Hamstra and i spoke offline just now.
> >>>
> >>> Below is a quick recap of our discussion on how they've achieved
> >>> acceptable performance from Spark on the user request/response path
> (@mark-
> >>> feel free to correct/comment).
> >>>
> >>> 1) there is a big difference in request/response latency between
> >>> submitting a full Spark Application (heavy weight) versus having a
> >>> long-running Spark Application (like Spark Job Server) that submits
> >>> lighter-weight Jobs using a shared SparkContext.  mark is obviously
> using
> >>> the latter - a long-running Spark App.
> >>>
> >>> 2) there are some enhancements to Spark that are required to achieve
> >>> acceptable user request/response times.  some links that Mark provided
> are
> >>> as follows:
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-11838
> >>> https://github.com/apache/spark/pull/11036
> >>> https://github.com/apache/spark/pull/11403
> >>> https://issues.apache.org/jira/browse/SPARK-13523
> >>> https://issues.apache.org/jira/browse/SPARK-13756
> >>>
> >>> Essentially, a deeper level of caching at the shuffle file layer to
> >>> reduce compute and memory between queries.
> >>>
> >>> Note that Mark is running a slightly-modified version of stock Spark.
> >>> (He's mentioned this in prior posts, as well.)
> >>>
> >>> And I have to say that I'm, personally, seeing more and more
> >>> slightly-modified versions of Spark being deployed to production to
> >>> workaround outstanding PR's and Jiras.
> >>>
> >>> this may not be what people want to hear, but it's a trend that i'm
> >>> seeing lately as more and more customize Spark to their specific use
> cases.
> >>>
> >>> Anyway, thanks for the good discussion, everyone!  This is why we have
> >>> these lists, right!  :)
> >>>
> >>>
> >>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan 
> >>> wrote:
> 
>  One of the premises here is that if you can restrict your workload to
>  fewer cores - which is easier with FiloDB and careful data modeling -
>  you can make this work for much higher concurrency and lower latency
>  than most typical Spark use cases.
> 
>  The reason why it typically does not work in production is that most
>  people are using HDFS and files.  These data sources are designed for
>  running queries and workloads on all your cores across many workers,
>  and not for filtering your workload down to only one or two cores.
> 
>  There is actually nothing inherent in Spark that prevents people

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh

You can build using maven from the command line as well.

This layout should give you an idea and here are some resources - 
http://www.scala-lang.org/old/node/345 

project/
   pom.xml   -  Defines the project
   src/
  main/
  java/ - Contains all java code that will go in your final artifact.  
  See maven-compiler-plugin 
 for details
  scala/ - Contains all scala code that will go in your final artifact. 

   See maven-scala-plugin 
 for details
  resources/ - Contains all static files that should be available on 
the classpath 
   in the final artifact.  See maven-resources-plugin 
 for details
  webapp/ - Contains all content for a web application (jsps, css, 
images, etc.)  
See maven-war-plugin 
 for details
 site/ - Contains all apt or xdoc files used to create a project website.  
 See maven-site-plugin 
 for details   
 test/
 java/ - Contains all java code used for testing.   
 See maven-compiler-plugin 
 for details
 scala/ - Contains all scala code used for testing.   
  See maven-scala-plugin 
 for details
 resources/ - Contains all static content that should be available on 
the 
  classpath during testing.   See maven-resources-plugin 
 for details

> On Mar 15, 2016, at 12:38 PM, Chandeep Singh  wrote:
> 
> Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/ 
> 
> 
> Once you have it setup, File -> New -> Other -> MavenProject -> Next / 
> Finish. You’ll see a default POM.xml which you can modify / replace. 
> 
> 
> 
> Here is some documentation that should help: 
> http://scala-ide.org/docs/tutorials/m2eclipse/ 
> 
> 
> I’m using the same Eclipse build as you on my Mac. I mostly build a shaded 
> JAR and SCP it to the cluster.
> 
>> On Mar 15, 2016, at 12:22 PM, Mich Talebzadeh > > wrote:
>> 
>> Great Chandeep. I also have Eclipse Scala IDE below
>> 
>> scala IDE build of Eclipse SDK
>> Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe
>> 
>> I am no expert on Eclipse so if I create project called ImportCSV where do I 
>> need to put the pom file or how do I reference it please. My Eclipse runs on 
>> a Linux host so it cab access all the directories that sbt project accesses? 
>> I also believe there will not be any need for external jar files in builkd 
>> path?
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 15 March 2016 at 12:15, Chandeep Singh > > wrote:
>> Btw, just to add to the confusion ;) I use Maven as well since I moved from 
>> Java to Scala but everyone I talk to has been recommending SBT for Scala. 
>> 
>> I use the Eclipse Scala IDE to build. http://scala-ide.org/ 
>> 
>> 
>> Here is my sample PoM. You can add dependancies based on your requirement.
>> 
>> http://maven.apache.org/POM/4.0.0 
>> " 
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance 
>> "
>>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>>  http://maven.apache.org/maven-v4_0_0.xsd 
>> ">
>>  4.0.0
>>  spark
>>  1.0
>>  ${project.artifactId}
>> 
>>  
>>  1.7
>>  1.7
>>  UTF-8
>>  2.10.4
>>  2.15.2
>>  
>> 
>>  
>>  
>>  cloudera-repo-releases
>>  
>> https://repository.cloudera.com/artifactory/repo/ 
>> 
>>  
>>  
>> 
>>  
>>  
>>  org.scala-lang
>>  scala-library
>>  ${scala.version}
>>  
>>  
>>  org.apache.spark
>>  spark-core_2.10
>>  1.5.0-cdh5.5.1
>>

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh

sounds like the layout is basically the same as sbt layout, the sbt file is
replaced by pom.xml?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 13:06, Mich Talebzadeh 
wrote:

> Thanks again
>
> Is there anyway one can set this one up without eclipse much like what I
> did with sbt?
>
> I need to know the directory structure foe MVN project.
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 March 2016 at 12:38, Chandeep Singh  wrote:
>
>> Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/
>>
>> Once you have it setup, File -> New -> Other -> MavenProject -> Next /
>> Finish. You’ll see a default POM.xml which you can modify / replace.
>>
>>
>> Here is some documentation that should help:
>> http://scala-ide.org/docs/tutorials/m2eclipse/
>>
>> I’m using the same Eclipse build as you on my Mac. I mostly build a
>> shaded JAR and SCP it to the cluster.
>>
>> On Mar 15, 2016, at 12:22 PM, Mich Talebzadeh 
>> wrote:
>>
>> Great Chandeep. I also have Eclipse Scala IDE below
>>
>> scala IDE build of Eclipse SDK
>> Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe
>>
>> I am no expert on Eclipse so if I create project called ImportCSV where
>> do I need to put the pom file or how do I reference it please. My Eclipse
>> runs on a Linux host so it cab access all the directories that sbt project
>> accesses? I also believe there will not be any need for external jar files
>> in builkd path?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 15 March 2016 at 12:15, Chandeep Singh  wrote:
>>
>>> Btw, just to add to the confusion ;) I use Maven as well since I moved
>>> from Java to Scala but everyone I talk to has been recommending SBT for
>>> Scala.
>>>
>>> I use the Eclipse Scala IDE to build. http://scala-ide.org/
>>>
>>> Here is my sample PoM. You can add dependancies based on your
>>> requirement.
>>>
>>> http://maven.apache.org/POM/4.0.0; xmlns:xsi="
>>> http://www.w3.org/2001/XMLSchema-instance;
>>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>>> http://maven.apache.org/maven-v4_0_0.xsd;>
>>> 4.0.0
>>> spark
>>> 1.0
>>> ${project.artifactId}
>>>
>>> 
>>> 1.7
>>> 1.7
>>> UTF-8
>>> 2.10.4
>>> 2.15.2
>>> 
>>>
>>> 
>>> 
>>> cloudera-repo-releases
>>> https://repository.cloudera.com/artifactory/repo/
>>> 
>>> 
>>>
>>> 
>>> 
>>> org.scala-lang
>>> scala-library
>>> ${scala.version}
>>> 
>>> 
>>> org.apache.spark
>>> spark-core_2.10
>>> 1.5.0-cdh5.5.1
>>> 
>>> 
>>> org.apache.spark
>>> spark-mllib_2.10
>>> 1.5.0-cdh5.5.1
>>> 
>>> 
>>> org.apache.spark
>>> spark-hive_2.10
>>> 1.5.0
>>> 
>>>
>>> 
>>> 
>>> src/main/scala
>>> src/test/scala
>>> 
>>> 
>>> org.scala-tools
>>> maven-scala-plugin
>>> ${maven-scala-plugin.version}
>>> 
>>> 
>>> 
>>> compile
>>> testCompile
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -Xms64m
>>> -Xmx1024m
>>> 
>>> 
>>> 
>>> 
>>> org.apache.maven.plugins
>>> maven-shade-plugin
>>> 1.6
>>> 
>>> 
>>> package
>>> 
>>> shade
>>> 
>>> 
>>> 
>>> 
>>> *:*
>>> 
>>> META-INF/*.SF
>>> META-INF/*.DSA
>>> META-INF/*.RSA
>>> 
>>> 
>>> 
>>> 
>>> >>
>>> implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
>>> com.group.id.Launcher1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>
>>> scala
>>> 
>>>
>>>
>>> On Mar 15, 2016, at 12:09 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>> Ok.
>>>
>>> Sounds like opinion is divided :)
>>>
>>> I will try to build a scala app with Maven.
>>>
>>> When I build with SBT I follow this directory structure
>>>
>>> High level directory the package name like
>>>
>>> ImportCSV
>>>
>>> under ImportCSV I have a directory src and the sbt file ImportCSV.sbt
>>>
>>> in directory src I have main and scala subdirectories. My scala file is
>>> in
>>>
>>> ImportCSV/src/main/scala
>>>
>>> called ImportCSV.scala
>>>
>>> I then have a shell script that runs everything under ImportCSV directory
>>>
>>> cat generic.ksh
>>> #!/bin/ksh
>>>
>>> #
>>> #
>>> # Procedure:generic.ksh
>>> #
>>> # Description:  Compiles and run scala app usinbg sbt and spark-submit
>>> #
>>> # Parameters:   none
>>> #
>>>
>>> #
>>>

Re: Installing Spark on Mac

2016-03-15 Thread Aida Tefera

Hi Jakob, sorry for my late reply

I tried to run the below; came back with "netstat: lunt: unknown or 
uninstrumented protocol

I also tried uninstalling version 1.6.0 and installing version1.5.2 with Java 7 
and SCALA version 2.10.6; got the same error messages

Do you think it would be worth me trying to change the IP address in 
SPARK_MASTER_IP to the IP address of the master node? If so, how would I go 
about doing that? 

Thanks, 

Aida

Sent from my iPhone

> On 11 Mar 2016, at 08:37, Jakob Odersky  wrote:
> 
> regarding my previous message, I forgot to mention to run netstat as
> root (sudo netstat -plunt)
> sorry for the noise
> 
>> On Fri, Mar 11, 2016 at 12:29 AM, Jakob Odersky  wrote:
>> Some more diagnostics/suggestions:
>> 
>> 1) are other services listening to ports in the 4000 range (run
>> "netstat -plunt")? Maybe there is an issue with the error message
>> itself.
>> 
>> 2) are you sure the correct java version is used? java -version
>> 
>> 3) can you revert all installation attempts you have done so far,
>> including files installed by brew/macports or maven and try again?
>> 
>> 4) are there any special firewall rules in place, forbidding
>> connections on localhost?
>> 
>> This is very weird behavior you're seeing. Spark is supposed to work
>> out-of-the-box with ZERO configuration necessary for running a local
>> shell. Again, my prime suspect is a previous, failed Spark
>> installation messing up your config.
>> 
>>> On Thu, Mar 10, 2016 at 12:24 PM, Tristan Nixon  
>>> wrote:
>>> If you type ‘whoami’ in the terminal, and it responds with ‘root’ then 
>>> you’re the superuser.
>>> However, as mentioned below, I don’t think its a relevant factor.
>>> 
 On Mar 10, 2016, at 12:02 PM, Aida Tefera  wrote:
 
 Hi Tristan,
 
 I'm afraid I wouldn't know whether I'm running it as super user.
 
 I have java version 1.8.0_73 and SCALA version 2.11.7
 
 Sent from my iPhone
 
> On 9 Mar 2016, at 21:58, Tristan Nixon  wrote:
> 
> That’s very strange. I just un-set my SPARK_HOME env param, downloaded a 
> fresh 1.6.0 tarball,
> unzipped it to local dir (~/Downloads), and it ran just fine - the driver 
> port is some randomly generated large number.
> So SPARK_HOME is definitely not needed to run this.
> 
> Aida, you are not running this as the super-user, are you?  What versions 
> of Java & Scala do you have installed?
> 
>> On Mar 9, 2016, at 3:53 PM, Aida Tefera  wrote:
>> 
>> Hi Jakob,
>> 
>> Tried running the command env|grep SPARK; nothing comes back
>> 
>> Tried env|grep Spark; which is the directory I created for Spark once I 
>> downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark
>> 
>> Tried running ./bin/spark-shell ; comes back with same error as below; 
>> i.e could not bind to port 0 etc.
>> 
>> Sent from my iPhone
>> 
>>> On 9 Mar 2016, at 21:42, Jakob Odersky  wrote:
>>> 
>>> As Tristan mentioned, it looks as though Spark is trying to bind on
>>> port 0 and then 1 (which is not allowed). Could it be that some
>>> environment variables from you previous installation attempts are
>>> polluting your configuration?
>>> What does running "env | grep SPARK" show you?
>>> 
>>> Also, try running just "/bin/spark-shell" (without the --master
>>> argument), maybe your shell is doing some funky stuff with the
>>> brackets.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
>> 
>>> On Thu, Mar 10, 2016 at 12:24 PM, Tristan Nixon  
>>> wrote:
>>> If you type ‘whoami’ in the terminal, and it responds with ‘root’ then 
>>> you’re the superuser.
>>> However, as mentioned below, I don’t think its a relevant factor.
>>> 
 On Mar 10, 2016, at 12:02 PM, Aida Tefera  wrote:
 
 Hi Tristan,
 
 I'm afraid I wouldn't know whether I'm running it as super user.
 
 I have java version 1.8.0_73 and SCALA version 2.11.7
 
 Sent from my iPhone
 
> On 9 Mar 2016, at 21:58, Tristan Nixon  wrote:
> 
> That’s very strange. I just un-set my SPARK_HOME env param, downloaded a 
> fresh 1.6.0 tarball,
> unzipped it to local dir (~/Downloads), and it ran just fine - the driver 
> port is some randomly generated large number.
> So SPARK_HOME is definitely not needed to run

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh

Thanks again

Is there anyway one can set this one up without eclipse much like what I
did with sbt?

I need to know the directory structure foe MVN project.

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 12:38, Chandeep Singh  wrote:

> Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/
>
> Once you have it setup, File -> New -> Other -> MavenProject -> Next /
> Finish. You’ll see a default POM.xml which you can modify / replace.
>
>
> Here is some documentation that should help:
> http://scala-ide.org/docs/tutorials/m2eclipse/
>
> I’m using the same Eclipse build as you on my Mac. I mostly build a shaded
> JAR and SCP it to the cluster.
>
> On Mar 15, 2016, at 12:22 PM, Mich Talebzadeh 
> wrote:
>
> Great Chandeep. I also have Eclipse Scala IDE below
>
> scala IDE build of Eclipse SDK
> Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe
>
> I am no expert on Eclipse so if I create project called ImportCSV where do
> I need to put the pom file or how do I reference it please. My Eclipse runs
> on a Linux host so it cab access all the directories that sbt project
> accesses? I also believe there will not be any need for external jar files
> in builkd path?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 March 2016 at 12:15, Chandeep Singh  wrote:
>
>> Btw, just to add to the confusion ;) I use Maven as well since I moved
>> from Java to Scala but everyone I talk to has been recommending SBT for
>> Scala.
>>
>> I use the Eclipse Scala IDE to build. http://scala-ide.org/
>>
>> Here is my sample PoM. You can add dependancies based on your requirement.
>>
>> http://maven.apache.org/POM/4.0.0; xmlns:xsi="
>> http://www.w3.org/2001/XMLSchema-instance;
>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>> http://maven.apache.org/maven-v4_0_0.xsd;>
>> 4.0.0
>> spark
>> 1.0
>> ${project.artifactId}
>>
>> 
>> 1.7
>> 1.7
>> UTF-8
>> 2.10.4
>> 2.15.2
>> 
>>
>> 
>> 
>> cloudera-repo-releases
>> https://repository.cloudera.com/artifactory/repo/
>> 
>> 
>>
>> 
>> 
>> org.scala-lang
>> scala-library
>> ${scala.version}
>> 
>> 
>> org.apache.spark
>> spark-core_2.10
>> 1.5.0-cdh5.5.1
>> 
>> 
>> org.apache.spark
>> spark-mllib_2.10
>> 1.5.0-cdh5.5.1
>> 
>> 
>> org.apache.spark
>> spark-hive_2.10
>> 1.5.0
>> 
>>
>> 
>> 
>> src/main/scala
>> src/test/scala
>> 
>> 
>> org.scala-tools
>> maven-scala-plugin
>> ${maven-scala-plugin.version}
>> 
>> 
>> 
>> compile
>> testCompile
>> 
>> 
>> 
>> 
>> 
>> -Xms64m
>> -Xmx1024m
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-shade-plugin
>> 1.6
>> 
>> 
>> package
>> 
>> shade
>> 
>> 
>> 
>> 
>> *:*
>> 
>> META-INF/*.SF
>> META-INF/*.DSA
>> META-INF/*.RSA
>> 
>> 
>> 
>> 
>> >
>> implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
>> com.group.id.Launcher1
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>
>> scala
>> 
>>
>>
>> On Mar 15, 2016, at 12:09 PM, Mich Talebzadeh 
>> wrote:
>>
>> Ok.
>>
>> Sounds like opinion is divided :)
>>
>> I will try to build a scala app with Maven.
>>
>> When I build with SBT I follow this directory structure
>>
>> High level directory the package name like
>>
>> ImportCSV
>>
>> under ImportCSV I have a directory src and the sbt file ImportCSV.sbt
>>
>> in directory src I have main and scala subdirectories. My scala file is in
>>
>> ImportCSV/src/main/scala
>>
>> called ImportCSV.scala
>>
>> I then have a shell script that runs everything under ImportCSV directory
>>
>> cat generic.ksh
>> #!/bin/ksh
>>
>> #
>> #
>> # Procedure:generic.ksh
>> #
>> # Description:  Compiles and run scala app usinbg sbt and spark-submit
>> #
>> # Parameters:   none
>> #
>>
>> #
>> # Vers|  Date  | Who | DA | Description
>>
>> #-++-++-
>> # 1.0 |04/03/15|  MT || Initial Version
>>
>> #
>> #
>> function F_USAGE
>> {
>>echo "USAGE: ${1##*/} -A ''"
>>echo "USAGE: ${1##*/} -H '' -h ''"
>>exit 10
>> }
>> #
>> # Main Section
>> #
>> if [[ "${1}" = "-h" || "${1}" = "-H" ]]; then
>>F_USAGE $0
>> fi
>> ## MAP INPUT TO VARIABLES
>> while getopts A: opt
>> do
>>case $opt in
>>(A) APPLICATION="$OPTARG" ;;
>>(*) F_USAGE $0 ;;
>>esac
>> done
>> [[ -z

mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manasdebashiskar

Hi, 
 I have a streaming application that takes data from a kafka topic and uses
mapwithstate.
 After couple of hours of smooth running of the application I see a problem
that seems to have stalled my application.
The batch seems to have been stuck after the following error popped up.
Has anyone seen this error or know what causes it? 
14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
cleaning broadcast 7456
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
at
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
at
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
at
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
at
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
at
org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[120 seconds]
at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread licl

HI，

I tried to build a cube on a 100 million data set.
When I set 9 fields to build the cube with 10 cores.
It nearly coast me a whole day to finish the job.
At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
Could we refer to the ”fast cube“ algorithm in apache Kylin
To make the cube builder more quickly???

even run the group by first and generate the cube is more quilk.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Imporvement-the-cube-with-the-Fast-Cubing-In-apache-Kylin-tp26499.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread 李承霖

HI，


I tried to build a cube on a 100 million data set.
When I set 9 fields to build the cube with 10 cores.
It nearly coast me a whole day to finish the job.
At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
Could we refer to the ”fast cube“ algorithm in apache Kylin

To make the cube builder more quickly???


even run the group by first and generate the cube is more quilk.

Re: Compress individual RDD

2016-03-15 Thread Nirav Patel

Thanks Sabarish, I thought of same. will try that.

Hi Ted, good question. I guess one way is to have an api like
`rdd.persist(storageLevel,
compress)` where 'compress' can be true or false.

On Tue, Mar 15, 2016 at 5:18 PM, Sabarish Sasidharan  wrote:

> It will compress only rdds with serialization enabled in the persistence
> mode. So you could skip _SER modes for your other rdds. Not perfect but
> something.
> On 15-Mar-2016 4:33 pm, "Nirav Patel"  wrote:
>
>> Hi,
>>
>> I see that there's following spark config to compress an RDD.  My guess
>> is it will compress all RDDs of a given SparkContext, right?  If so, is
>> there a way to instruct spark context to only compress some rdd and leave
>> others uncompressed ?
>>
>> Thanks
>>
>> spark.rdd.compress false Whether to compress serialized RDD partitions
>> (e.g. forStorageLevel.MEMORY_ONLY_SER). Can save substantial space at
>> the cost of some extra CPU time.
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube]

Spark work distribution among execs

2016-03-15 Thread Borislav Kapukaranov

Hi,

I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster.
I observe a very strange issue.
I run a simple job that reads about 1TB of json logs from a remote HDFS
cluster and converts them to parquet, then saves them to the local HDFS of
the Hadoop cluster.

I run it with 25 executors with sufficient resources. However the strange
thing is that the job only uses 2 executors to do most of the read work.

For example when I go to the Executors' tab in the Spark UI and look at the
"Input" column, the difference between the nodes is huge, sometimes 20G vs
120G.

Any ideas how to achieve a more balanced performance?

Thanks,
Borislav

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh

Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/ 


Once you have it setup, File -> New -> Other -> MavenProject -> Next / Finish. 
You’ll see a default POM.xml which you can modify / replace. 



Here is some documentation that should help: 
http://scala-ide.org/docs/tutorials/m2eclipse/ 


I’m using the same Eclipse build as you on my Mac. I mostly build a shaded JAR 
and SCP it to the cluster.

> On Mar 15, 2016, at 12:22 PM, Mich Talebzadeh  
> wrote:
> 
> Great Chandeep. I also have Eclipse Scala IDE below
> 
> scala IDE build of Eclipse SDK
> Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe
> 
> I am no expert on Eclipse so if I create project called ImportCSV where do I 
> need to put the pom file or how do I reference it please. My Eclipse runs on 
> a Linux host so it cab access all the directories that sbt project accesses? 
> I also believe there will not be any need for external jar files in builkd 
> path?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 15 March 2016 at 12:15, Chandeep Singh  > wrote:
> Btw, just to add to the confusion ;) I use Maven as well since I moved from 
> Java to Scala but everyone I talk to has been recommending SBT for Scala. 
> 
> I use the Eclipse Scala IDE to build. http://scala-ide.org/ 
> 
> 
> Here is my sample PoM. You can add dependancies based on your requirement.
> 
> http://maven.apache.org/POM/4.0.0 
> " 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance 
> "
>   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>  http://maven.apache.org/maven-v4_0_0.xsd 
> ">
>   4.0.0
>   spark
>   1.0
>   ${project.artifactId}
> 
>   
>   1.7
>   1.7
>   UTF-8
>   2.10.4
>   2.15.2
>   
> 
>   
>   
>   cloudera-repo-releases
>   
> https://repository.cloudera.com/artifactory/repo/ 
> 
>   
>   
> 
>   
>   
>   org.scala-lang
>   scala-library
>   ${scala.version}
>   
>   
>   org.apache.spark
>   spark-core_2.10
>   1.5.0-cdh5.5.1
>   
>   
>   org.apache.spark
>   spark-mllib_2.10
>   1.5.0-cdh5.5.1
>   
>   
>   org.apache.spark
>   spark-hive_2.10
>   1.5.0
>   
> 
>   
>   
>   src/main/scala
>   src/test/scala
>   
>   
>   org.scala-tools
>   maven-scala-plugin
>   ${maven-scala-plugin.version}
>   
>   
>   
>   compile
>   testCompile
>   
>   
>   
>   
>   
>   -Xms64m
>   -Xmx1024m
>   
>   
>   
>   
>   org.apache.maven.plugins
>   maven-shade-plugin
>   1.6
>   
>   
>   package
>   
>   shade
>   
>   
>   
>   
>   
> *:*
>   
> 
>

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh

Great Chandeep. I also have Eclipse Scala IDE below

scala IDE build of Eclipse SDK
Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe

I am no expert on Eclipse so if I create project called ImportCSV where do
I need to put the pom file or how do I reference it please. My Eclipse runs
on a Linux host so it cab access all the directories that sbt project
accesses? I also believe there will not be any need for external jar files
in builkd path?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 12:15, Chandeep Singh  wrote:

> Btw, just to add to the confusion ;) I use Maven as well since I moved
> from Java to Scala but everyone I talk to has been recommending SBT for
> Scala.
>
> I use the Eclipse Scala IDE to build. http://scala-ide.org/
>
> Here is my sample PoM. You can add dependancies based on your requirement.
>
> http://maven.apache.org/POM/4.0.0; xmlns:xsi="
> http://www.w3.org/2001/XMLSchema-instance;
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/maven-v4_0_0.xsd;>
> 4.0.0
> spark
> 1.0
> ${project.artifactId}
>
> 
> 1.7
> 1.7
> UTF-8
> 2.10.4
> 2.15.2
> 
>
> 
> 
> cloudera-repo-releases
> https://repository.cloudera.com/artifactory/repo/
> 
> 
>
> 
> 
> org.scala-lang
> scala-library
> ${scala.version}
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.5.0-cdh5.5.1
> 
> 
> org.apache.spark
> spark-mllib_2.10
> 1.5.0-cdh5.5.1
> 
> 
> org.apache.spark
> spark-hive_2.10
> 1.5.0
> 
>
> 
> 
> src/main/scala
> src/test/scala
> 
> 
> org.scala-tools
> maven-scala-plugin
> ${maven-scala-plugin.version}
> 
> 
> 
> compile
> testCompile
> 
> 
> 
> 
> 
> -Xms64m
> -Xmx1024m
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-shade-plugin
> 1.6
> 
> 
> package
> 
> shade
> 
> 
> 
> 
> *:*
> 
> META-INF/*.SF
> META-INF/*.DSA
> META-INF/*.RSA
> 
> 
> 
> 
> 
> implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
> com.group.id.Launcher1
> 
> 
> 
> 
> 
> 
> 
> 
>
> scala
> 
>
>
> On Mar 15, 2016, at 12:09 PM, Mich Talebzadeh 
> wrote:
>
> Ok.
>
> Sounds like opinion is divided :)
>
> I will try to build a scala app with Maven.
>
> When I build with SBT I follow this directory structure
>
> High level directory the package name like
>
> ImportCSV
>
> under ImportCSV I have a directory src and the sbt file ImportCSV.sbt
>
> in directory src I have main and scala subdirectories. My scala file is in
>
> ImportCSV/src/main/scala
>
> called ImportCSV.scala
>
> I then have a shell script that runs everything under ImportCSV directory
>
> cat generic.ksh
> #!/bin/ksh
>
> #
> #
> # Procedure:generic.ksh
> #
> # Description:  Compiles and run scala app usinbg sbt and spark-submit
> #
> # Parameters:   none
> #
>
> #
> # Vers|  Date  | Who | DA | Description
>
> #-++-++-
> # 1.0 |04/03/15|  MT || Initial Version
>
> #
> #
> function F_USAGE
> {
>echo "USAGE: ${1##*/} -A ''"
>echo "USAGE: ${1##*/} -H '' -h ''"
>exit 10
> }
> #
> # Main Section
> #
> if [[ "${1}" = "-h" || "${1}" = "-H" ]]; then
>F_USAGE $0
> fi
> ## MAP INPUT TO VARIABLES
> while getopts A: opt
> do
>case $opt in
>(A) APPLICATION="$OPTARG" ;;
>(*) F_USAGE $0 ;;
>esac
> done
> [[ -z ${APPLICATION} ]] && print "You must specify an application value "
> && F_USAGE $0
> ENVFILE=/home/hduser/dba/bin/environment.ksh
> if [[ -f $ENVFILE ]]
> then
> . $ENVFILE
> . ~/spark_1.5.2_bin-hadoop2.6.kshrc
> else
> echo "Abort: $0 failed. No environment file ( $ENVFILE ) found"
> exit 1
> fi
> ##FILE_NAME=`basename $0 .ksh`
> FILE_NAME=${APPLICATION}
> CLASS=`echo ${FILE_NAME}|tr "[:upper:]" "[:lower:]"`
> NOW="`date +%Y%m%d_%H%M`"
> LOG_FILE=${LOGDIR}/${FILE_NAME}.log
> [ -f ${LOG_FILE} ] && rm -f ${LOG_FILE}
> print "\n" `date` ", Started $0" | tee -a ${LOG_FILE}
> cd ../${FILE_NAME}
> print "Compiling ${FILE_NAME}" | tee -a ${LOG_FILE}
> sbt package
> print "Submiiting the job" | tee -a ${LOG_FILE}
>
> ${SPARK_HOME}/bin/spark-submit \
> --packages com.databricks:spark-csv_2.11:1.3.0 \
> --class "${FILE_NAME}" \
> --master spark://50.140.197.217:7077 \
> --executor-memory=12G \
> --executor-cores=12 \
> --num-executors=2 \
> target/scala-2.10/${CLASS}_2.10-1.0.jar
> print `date` ", Finished $0" | tee -a ${LOG_FILE}
> exit
>
>
> So to run it for ImportCSV all I

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh

Btw, just to add to the confusion ;) I use Maven as well since I moved from 
Java to Scala but everyone I talk to has been recommending SBT for Scala. 

I use the Eclipse Scala IDE to build. http://scala-ide.org/ 


Here is my sample PoM. You can add dependancies based on your requirement.

http://maven.apache.org/POM/4.0.0; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/maven-v4_0_0.xsd;>
4.0.0
spark
1.0
${project.artifactId}


1.7
1.7
UTF-8
2.10.4
2.15.2




cloudera-repo-releases

https://repository.cloudera.com/artifactory/repo/





org.scala-lang
scala-library
${scala.version}


org.apache.spark
spark-core_2.10
1.5.0-cdh5.5.1


org.apache.spark
spark-mllib_2.10
1.5.0-cdh5.5.1


org.apache.spark
spark-hive_2.10
1.5.0




src/main/scala
src/test/scala


org.scala-tools
maven-scala-plugin
${maven-scala-plugin.version}



compile
testCompile





-Xms64m
-Xmx1024m




org.apache.maven.plugins
maven-shade-plugin
1.6


package

shade





*:*



META-INF/*.SF

META-INF/*.DSA

META-INF/*.RSA







com.group.id.Launcher1









scala



> On Mar 15, 2016, at 12:09 PM, Mich Talebzadeh  
> wrote:
> 
> Ok.
> 
> Sounds like opinion is divided :)
> 
> I will try to build a scala app with Maven.
> 
> When I build with SBT I follow this directory structure
> 
> High level directory the package name like
> 
> ImportCSV
> 
> under ImportCSV I have a directory src and the sbt file ImportCSV.sbt
> 
> in directory src I have main and scala subdirectories. My scala file is in
> 
> ImportCSV/src/main/scala
> 
> called ImportCSV.scala
> 
> I then have a shell script that runs everything under ImportCSV directory
> 
> cat generic.ksh
> #!/bin/ksh
> #
> #
> # Procedure:generic.ksh
> #

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh

Ok.

Sounds like opinion is divided :)

I will try to build a scala app with Maven.

When I build with SBT I follow this directory structure

High level directory the package name like

ImportCSV

under ImportCSV I have a directory src and the sbt file ImportCSV.sbt

in directory src I have main and scala subdirectories. My scala file is in

ImportCSV/src/main/scala

called ImportCSV.scala

I then have a shell script that runs everything under ImportCSV directory

cat generic.ksh
#!/bin/ksh
#
#
# Procedure:generic.ksh
#
# Description:  Compiles and run scala app usinbg sbt and spark-submit
#
# Parameters:   none
#
#
# Vers|  Date  | Who | DA | Description
#-++-++-
# 1.0 |04/03/15|  MT || Initial Version
#
#
function F_USAGE
{
   echo "USAGE: ${1##*/} -A ''"
   echo "USAGE: ${1##*/} -H '' -h ''"
   exit 10
}
#
# Main Section
#
if [[ "${1}" = "-h" || "${1}" = "-H" ]]; then
   F_USAGE $0
fi
## MAP INPUT TO VARIABLES
while getopts A: opt
do
   case $opt in
   (A) APPLICATION="$OPTARG" ;;
   (*) F_USAGE $0 ;;
   esac
done
[[ -z ${APPLICATION} ]] && print "You must specify an application value "
&& F_USAGE $0
ENVFILE=/home/hduser/dba/bin/environment.ksh
if [[ -f $ENVFILE ]]
then
. $ENVFILE
. ~/spark_1.5.2_bin-hadoop2.6.kshrc
else
echo "Abort: $0 failed. No environment file ( $ENVFILE ) found"
exit 1
fi
##FILE_NAME=`basename $0 .ksh`
FILE_NAME=${APPLICATION}
CLASS=`echo ${FILE_NAME}|tr "[:upper:]" "[:lower:]"`
NOW="`date +%Y%m%d_%H%M`"
LOG_FILE=${LOGDIR}/${FILE_NAME}.log
[ -f ${LOG_FILE} ] && rm -f ${LOG_FILE}
print "\n" `date` ", Started $0" | tee -a ${LOG_FILE}
cd ../${FILE_NAME}
print "Compiling ${FILE_NAME}" | tee -a ${LOG_FILE}
sbt package
print "Submiiting the job" | tee -a ${LOG_FILE}

${SPARK_HOME}/bin/spark-submit \
--packages com.databricks:spark-csv_2.11:1.3.0 \
--class "${FILE_NAME}" \
--master spark://50.140.197.217:7077 \
--executor-memory=12G \
--executor-cores=12 \
--num-executors=2 \
target/scala-2.10/${CLASS}_2.10-1.0.jar
print `date` ", Finished $0" | tee -a ${LOG_FILE}
exit


So to run it for ImportCSV all I need is to do

./generic.ksh -A ImportCSV

Now can anyone kindly give me a rough guideline on directory and location
of pom.xml to make this work using maven?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 March 2016 at 10:50, Sean Owen  wrote:

> FWIW, I strongly prefer Maven over SBT even for Scala projects. The
> Spark build of reference is Maven.
>
> On Tue, Mar 15, 2016 at 10:45 AM, Chandeep Singh  wrote:
> > For Scala, SBT is recommended.
> >
> > On Mar 15, 2016, at 10:42 AM, Mich Talebzadeh  >
> > wrote:
> >
> > Hi,
> >
> > I build my Spark/Scala packages using SBT that works fine. I have created
> > generic shell scripts to build and submit it.
> >
> > Yesterday I noticed that some use Maven and Pom for this purpose.
> >
> > Which approach is recommended?
> >
> > Thanks,
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> >
> >
>

Re: Compress individual RDD

2016-03-15 Thread Sabarish Sasidharan

It will compress only rdds with serialization enabled in the persistence
mode. So you could skip _SER modes for your other rdds. Not perfect but
something.
On 15-Mar-2016 4:33 pm, "Nirav Patel"  wrote:

> Hi,
>
> I see that there's following spark config to compress an RDD.  My guess is
> it will compress all RDDs of a given SparkContext, right?  If so, is there
> a way to instruct spark context to only compress some rdd and leave others
> uncompressed ?
>
> Thanks
>
> spark.rdd.compress false Whether to compress serialized RDD partitions
> (e.g. forStorageLevel.MEMORY_ONLY_SER). Can save substantial space at the
> cost of some extra CPU time.
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
>

Re: Hive Query on Spark fails with OOM

2016-03-15 Thread Sabarish Sasidharan

Yes, I suggested increasing shuffle partitions to address this problem. The
other suggestion to increase shuffle fraction was not for this but makes
sense given that you are reserving all that memory and doing nothing with
it. By diverting more of it for shuffles you can help improve your shuffle
performance.

Regards
Sab
On 14-Mar-2016 2:33 pm, "Prabhu Joseph"  wrote:

> The issue is the query hits OOM on a Stage when reading Shuffle Output
> from previous stage.How come increasing shuffle memory helps to avoid OOM.
>
> On Mon, Mar 14, 2016 at 2:28 PM, Sabarish Sasidharan <
> sabarish@gmail.com> wrote:
>
>> Thats a pretty old version of Spark SQL. It is devoid of all the
>> improvements introduced in the last few releases.
>>
>> You should try bumping your spark.sql.shuffle.partitions to a value
>> higher than default (5x or 10x). Also increase your shuffle memory fraction
>> as you really are not explicitly caching anything. You could simply swap
>> the fractions in your case.
>>
>> Regards
>> Sab
>>
>> On Mon, Mar 14, 2016 at 2:20 PM, Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> It is a Spark-SQL and the version used is Spark-1.2.1.
>>>
>>> On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan <
>>> sabarish.sasidha...@manthan.com> wrote:
>>>
 I believe the OP is using Spark SQL and not Hive on Spark.

 Regards
 Sab

 On Mon, Mar 14, 2016 at 1:55 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> I think the only version of Spark that works OK with Hive (Hive on
> Spark engine) is version 1.3.1. I also get OOM from time to time and have
> to revert using MR
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 March 2016 at 08:06, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> Which version of Spark are you using? The configuration varies by
>> version.
>>
>> Regards
>> Sab
>>
>> On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> A Hive Join query which runs fine and faster in MapReduce takes lot
>>> of time with Spark and finally fails with OOM.
>>>
>>> *Query:  hivejoin.py*
>>>
>>> from pyspark import SparkContext, SparkConf
>>> from pyspark.sql import HiveContext
>>> conf = SparkConf().setAppName("Hive_Join")
>>> sc = SparkContext(conf=conf)
>>> hiveCtx = HiveContext(sc)
>>> hiveCtx.hql("INSERT OVERWRITE TABLE D select <80 columns> from A a
>>> INNER JOIN B b ON a.item_id = b.item_id LEFT JOIN C c ON c.instance_id =
>>> a.instance_id");
>>> results = hiveCtx.hql("SELECT COUNT(1) FROM D").collect()
>>> print results
>>>
>>>
>>> *Data Study:*
>>>
>>> Number of Rows:
>>>
>>> A table has 1002093508
>>> B table has5371668
>>> C table has  1000
>>>
>>> No Data Skewness:
>>>
>>> item_id in B is unique and A has multiple rows with same item_id, so
>>> after first INNER_JOIN the result set is same 1002093508 rows
>>>
>>> instance_id in C is unique and A has multiple rows with same
>>> instance_id (maximum count of number of rows with same instance_id is 
>>> 250)
>>>
>>> Spark Job runs with 90 Executors each with 2cores and 6GB memory.
>>> YARN has allotted all the requested resource immediately and no other 
>>> job
>>> is running on the
>>> cluster.
>>>
>>> spark.storage.memoryFraction 0.6
>>> spark.shuffle.memoryFraction 0.2
>>>
>>> Stage 2 - reads data from Hadoop, Tasks has NODE_LOCAL and shuffle
>>> write 500GB of intermediate data
>>>
>>> Stage 3 - does shuffle read of 500GB data, tasks has PROCESS_LOCAL
>>> and output of 400GB is shuffled
>>>
>>> Stage 4 - tasks fails with OOM on reading the shuffled output data
>>> when it reached 40GB data itself
>>>
>>> First of all, what kind of Hive queries when run on Spark gets a
>>> better performance than Mapreduce. And what are the hive queries that 
>>> won't
>>> perform
>>> well in Spark.
>>>
>>> How to calculate the optimal Heap for Executor Memory and the number
>>> of executors for given input data size. We don't specify Spark 
>>> Executors to
>>> cache any data. But how come Stage 3 tasks says PROCESS_LOCAL. Why 
>>> Stage 4
>>> is failing immediately
>>> when it has just read 40GB data, is it caching data in Memory.
>>>
>>> And in a Spark job, some stage will need lot of memory for shuffle
>>> and some need lot of memory for cache. So, when a Spark Executor

Re: reading file from S3

2016-03-15 Thread Yasemin Kaya

Hi Safak,

I changed the Keys but there is no change.

Best,
yasemin


2016-03-15 12:46 GMT+02:00 Şafak Serdar Kapçı :

> Hello Yasemin,
> Maybe your key id or access key has special chars like backslash or
> something. You need to change it.
> Best Regards,
> Safak.
>
> 2016-03-15 12:33 GMT+02:00 Yasemin Kaya :
>
>> Hi,
>>
>> I am using Spark 1.6.0 standalone and I want to read a txt file from S3
>> bucket named yasemindeneme and my file name is deneme.txt. But I am getting
>> this error. Here is the simple code
>> 
>> Exception in thread "main" java.lang.IllegalArgumentException: Invalid
>> hostname in URI s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
>> /yasemindeneme/deneme.txt
>> at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
>> at
>> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
>>
>>
>> I try 2 options
>> *sc.hadoopConfiguration() *and
>> *sc.textFile("s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt/");*
>>
>> Also I did export AWS_ACCESS_KEY_ID= .
>>  export AWS_SECRET_ACCESS_KEY=
>> But there is no change about error.
>>
>> Could you please help me about this issue?
>>
>>
>> --
>> hiç ender hiç
>>
>
>


-- 
hiç ender hiç

Re: Compress individual RDD

2016-03-15 Thread Ted Yu

Looks like there is no such capability yet. 

How would you specify which rdd's to compress ?

Thanks

> On Mar 15, 2016, at 4:03 AM, Nirav Patel  wrote:
> 
> Hi,
> 
> I see that there's following spark config to compress an RDD.  My guess is it 
> will compress all RDDs of a given SparkContext, right?  If so, is there a way 
> to instruct spark context to only compress some rdd and leave others 
> uncompressed ?
> 
> Thanks
> 
> spark.rdd.compressfalse   Whether to compress serialized RDD partitions 
> (e.g. forStorageLevel.MEMORY_ONLY_SER). Can save substantial space at the 
> cost of some extra CPU time.
> 
> 
> 
> 
> 
>

Re: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ted Yu

I did a quick search but haven't found JIRA in this regard. 

If configuration is separate from checkpoint data, more use cases can be 
accommodated. 

> On Mar 15, 2016, at 2:21 AM, Saisai Shao  wrote:
> 
> Currently configuration is a part of checkpoint data, and when recovering 
> from failure, Spark Streaming will fetch the configuration from checkpoint 
> data, so even if you change the configuration file, recovered Spark Streaming 
> application will not use it. So from my understanding currently there's no 
> way to handle your situation.
> 
> Thanks
> Saisai
> 
>> On Tue, Mar 15, 2016 at 5:12 PM, Ewan Leith  
>> wrote:
>> Has anyone seen a way of updating the Spark streaming job configuration 
>> while retaining the existing data in the write ahead log?
>> 
>>  
>> 
>> e.g. if you’ve launched a job without enough executors and a backlog has 
>> built up in the WAL, can you increase the number of executors without losing 
>> the WAL data?
>> 
>>  
>> 
>> Thanks,
>> 
>> Ewan
>> 
>

Compress individual RDD

2016-03-15 Thread Nirav Patel

Hi,

I see that there's following spark config to compress an RDD.  My guess is
it will compress all RDDs of a given SparkContext, right?  If so, is there
a way to instruct spark context to only compress some rdd and leave others
uncompressed ?

Thanks

spark.rdd.compress false Whether to compress serialized RDD partitions
(e.g. forStorageLevel.MEMORY_ONLY_SER). Can save substantial space at the
cost of some extra CPU time.

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube]

1 2 >

1 - 100 of 132 matches

Mail list logo