Re: groupByKey does not work?

2016-01-05 Thread Sean Owen
I suspect this is another instance of case classes not working as
expected between the driver and executor when used with spark-shell.
Search JIRA for some back story.

On Tue, Jan 5, 2016 at 12:42 AM, Arun Luthra  wrote:
> Spark 1.5.0
>
> data:
>
> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>
> spark-shell:
>
> spark-shell \
> --num-executors 2 \
> --driver-memory 1g \
> --executor-memory 10g \
> --executor-cores 8 \
> --master yarn-client
>
>
> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
> f4:Char, f5:Char, f6:String)
> case class Myvalue(count1:Long, count2:Long, num:Double)
>
> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
> val spl = line.split("\\|", -1)
> val k = spl(0).split(",")
> val v = spl(1).split(",")
> (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
> k(5)(0).toChar, k(6)(0).toChar, k(7)),
>  Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
> )
> }}
>
> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
> }.collect().foreach(println)
>
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>
>
>
> You can see that each key is repeated 2 times but each key should only
> appear once.
>
> Arun
>
> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu  wrote:
>>
>> Can you give a bit more information ?
>>
>> Release of Spark you're using
>> Minimal dataset that shows the problem
>>
>> Cheers
>>
>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra  wrote:
>>>
>>> I tried groupByKey and noticed that it did not group all values into the
>>> same group.
>>>
>>> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
>>> distinct keys, so I expected there to be 4 records in the groupByKey object,
>>> but instead there were 8. Each of the 4 distinct keys appear 2 times.
>>>
>>> Is this the expected behavior? I need to be able to get ALL values
>>> associated with each key grouped into a SINGLE record. Is it possible?
>>>
>>> Arun
>>>
>>> p.s. reducebykey will not be sufficient for me
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is there a way to use parallelize function in sparkR spark version (1.6.0)

2016-01-05 Thread Chandan Verma
 




===
DISCLAIMER:
The information contained in this message (including any attachments) is 
confidential and may be privileged. If you have received it by mistake please 
notify the sender by return e-mail and permanently delete this message and any 
attachments from your system. Any dissemination, use, review, distribution, 
printing or copying of this message in whole or in part is strictly prohibited. 
Please note that e-mails are susceptible to change. CitiusTech shall not be 
liable for the improper or incomplete transmission of the information contained 
in this communication nor for any delay in its receipt or damage to your 
system. CitiusTech does not guarantee that the integrity of this communication 
has been maintained or that this communication is free of viruses, 
interceptions or interferences. 



finding distinct count using dataframe

2016-01-05 Thread Arunkumar Pillai
Hi

Is there any   functions to find distinct count of all the variables in
dataframe.

val sc = new SparkContext(conf) // spark context
val options = Map("header" -> "true", "delimiter" -> delimiter,
"inferSchema" -> "true")
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
val datasetDF =
sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)


we are able to get the schema, variable data type. is there any method
to get the distinct count ?



-- 
Thanks and Regards
Arun


Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
It seems currently spark.scheduler.pool must be set as localProperties
(associate with thread). Any reason why spark.scheduler.pool can not be
used globally.  My scenario is that I want my thriftserver started with
fair scheduler as the default pool without using set command to set the
pool. Is there anyway to do that ? Or do I miss anything here ?

-- 
Best Regards

Jeff Zhang


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Sean Owen
+juliet for an additional opinion, but FWIW I think it's safe to say
that future CDH will have a more consistent Python story and that
story will support 2.7 rather than 2.6.

On Tue, Jan 5, 2016 at 7:17 AM, Reynold Xin  wrote:
> Does anybody here care about us dropping support for Python 2.6 in Spark
> 2.0?
>
> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
> parsing) when compared with Python 2.7. Some libraries that Spark depend on
> stopped supporting 2.6. We can still convince the library maintainers to
> support 2.6, but it will be extra work. I'm curious if anybody still uses
> Python 2.6 to run Spark.
>
> Thanks.
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar,

You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
approxCountDistinct for a approximate result.

2016-01-05 17:11 GMT+08:00 Arunkumar Pillai :

> Hi
>
> Is there any   functions to find distinct count of all the variables in
> dataframe.
>
> val sc = new SparkContext(conf) // spark context
> val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" 
> -> "true")
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
> val datasetDF = 
> sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>
>
> we are able to get the schema, variable data type. is there any method to get 
> the distinct count ?
>
>
>
> --
> Thanks and Regards
> Arun
>


Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Hi there,

I have been using Spark 1.5.2 on my cluster without a problem and wanted to
try Spark 1.6.0.
I have the exact same configuration on both clusters.
I am able to start the Standalone Cluster but I fail to submit a job
getting errors like the following:

16/01/05 14:24:14 INFO AppClient$ClientEndpoint: Connecting to master
spark://my-ip:7077...
16/01/05 14:24:34 INFO AppClient$ClientEndpoint: Connecting to master
spark://my-ip:7077...
16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
spark://my-ip:7077...
16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
spark://my-ip:7077...
16/01/05 14:24:54 WARN TransportChannelHandler: Exception in connection
from my-ip/X.XXX.XX.XX:7077
java.lang.NoSuchMethodError:
java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:106)
at
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:586)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:577)
at
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170)
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

Has anyone else had similar problems?

Thanks a lot


RE: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
Thanks a lot for your prompt response.  I am pushing one message.


HashMap kafkaParams = new HashMap();  
kafkaParams.put("metadata.broker.list","localhost:9092");   
kafkaParams.put("zookeeper.connect", "localhost:2181");
JavaPairInputDStream messages = KafkaUtils.createDirectStream( 
jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, 
kafkaParams, topicsSet);
final Accumulator accum = jssc.sparkContext().accumulator(0);
JavaDStream lines = messages.map(
new Function, String>() {
   public String call(Tuple2 tuple2) { 
LOG.info("#  Input json stream data  # " +  
tuple2._2);accum.add(1); return tuple2._2();
} });
lines.foreachRDD(new Function() {
public Void call(JavaRDD rdd) throws Exception {
if(!rdd.isEmpty() || !rdd.partitions().isEmpty()){ 
rdd.saveAsTextFile("hdfs://quickstart.cloudera:8020/user/cloudera/testDirJan4/test1.text");}
System.out.println(" & COUNT OF ACCUMULATOR IS " + 
accum.value()); return null;}
 });
 jssc.start();

If I remove this saveAsTextFile I get correct count with this line I am getting 
double counting.



Here are the Stack trace with SaveAsText statement Please see double counting 
below:



&&& BEFORE COUNT OF ACCUMULATOR IS &&& 0

INFO : org.apache.spark.SparkContext - Starting job: foreachRDD at 
KafkaURLStreaming.java:90

INFO : org.apache.spark.scheduler.DAGScheduler - Got job 0 (foreachRDD at 
KafkaURLStreaming.java:90) with 1 output partitions

INFO : org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 
0(foreachRDD at KafkaURLStreaming.java:90)

INFO : org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()

INFO : org.apache.spark.scheduler.DAGScheduler - Missing parents: List()

INFO : org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at KafkaURLStreaming.java:83), which has no missing 
parents

INFO : org.apache.spark.storage.MemoryStore - ensureFreeSpace(3856) called with 
curMem=0, maxMem=1893865881

INFO : org.apache.spark.storage.MemoryStore - Block broadcast_0 stored as 
values in memory (estimated size 3.8 KB, free 1806.1 MB)

INFO : org.apache.spark.storage.MemoryStore - ensureFreeSpace(2225) called with 
curMem=3856, maxMem=1893865881

INFO : org.apache.spark.storage.MemoryStore - Block broadcast_0_piece0 stored 
as bytes in memory (estimated size 2.2 KB, free 1806.1 MB)

INFO : org.apache.spark.storage.BlockManagerInfo - Added broadcast_0_piece0 in 
memory on localhost:51637 (size: 2.2 KB, free: 1806.1 MB)

INFO : org.apache.spark.SparkContext - Created broadcast 0 from broadcast at 
DAGScheduler.scala:861

INFO : org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks 
from ResultStage 0 (MapPartitionsRDD[1] at map at KafkaURLStreaming.java:83)

INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 0.0 with 
1 tasks

INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 
0.0 (TID 0, localhost, ANY, 2026 bytes)

INFO : org.apache.spark.executor.Executor - Running task 0.0 in stage 0.0 (TID 
0)

INFO : org.apache.spark.streaming.kafka.KafkaRDD - Computing topic test11, 
partition 0 offsets 36 -> 37

INFO : kafka.utils.VerifiableProperties - Verifying properties

INFO : kafka.utils.VerifiableProperties - Property fetch.message.max.bytes is 
overridden to 1073741824

INFO : kafka.utils.VerifiableProperties - Property group.id is overridden to

INFO : kafka.utils.VerifiableProperties - Property zookeeper.connect is 
overridden to localhost:2181

INFO : com.markmonitor.antifraud.ce.KafkaURLStreaming - #  
Input json stream data  # one test message

INFO : org.apache.spark.executor.Executor - Finished task 0.0 in stage 0.0 (TID 
0). 972 bytes result sent to driver

INFO : org.apache.spark.scheduler.DAGScheduler - ResultStage 0 (foreachRDD at 
KafkaURLStreaming.java:90) finished in 0.133 s

INFO : org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 
0.0 (TID 0) in 116 ms on localhost (1/1)

INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 0.0, 
whose tasks have all completed, from pool

INFO : org.apache.spark.scheduler.DAGScheduler - Job 0 finished: foreachRDD at 
KafkaURLStreaming.java:90, took 0.496657 s

INFO : org.apache.spark.ContextCleaner - Cleaned accumulator 2

INFO : org.apache.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 
on localhost:51637 in memory (size: 2.2 KB, free: 1806.1 MB)

INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.tip.id is 
deprecated. Instead, use mapreduce.task.id

INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.task.id is 
deprecated. Instead, use mapreduce.task.attempt.id

INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.task.is.map is 

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Dean Wampler
​Still, it would be good to know what happened exactly. Why did the netty
dependency expect Java 8?  Did you build your app on a machine with Java 8
and deploy on a Java 7 machine?​

Anyway, I played with the 1.6.0 spark-shell using Java 7 and it worked
fine. I also looked at the distribution's class files using e.g.,

$ cd $HOME/spark/spark-1.6.0-bin-hadoop2.6
​$
 jar xf lib/spark-assembly-1.6.0-hadoop2.6.0.jar
org/apache/spark/rpc/netty/Dispatcher.class
$ javap -classpath . -verbose org.apache.spark.rpc.netty.Dispatcher | grep
version
  minor version: 0
  major version: 50

So, it was compiled with Java 6 (see
https://en.wikipedia.org/wiki/Java_class_file). So, it doesn't appear to be
a Spark build issue.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Tue, Jan 5, 2016 at 9:01 AM, Yiannis Gkoufas 
wrote:

> Hi Dean,
>
> thanks so much for the response! It works without a problem now!
>
> On 5 January 2016 at 14:33, Dean Wampler  wrote:
>
>> ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method. The
>> Java 7 method returns a Set. Are you running Java 7? What happens if you
>> run Java 8?
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Tue, Jan 5, 2016 at 8:29 AM, Yiannis Gkoufas 
>> wrote:
>>
>>> Hi there,
>>>
>>> I have been using Spark 1.5.2 on my cluster without a problem and wanted
>>> to try Spark 1.6.0.
>>> I have the exact same configuration on both clusters.
>>> I am able to start the Standalone Cluster but I fail to submit a job
>>> getting errors like the following:
>>>
>>> 16/01/05 14:24:14 INFO AppClient$ClientEndpoint: Connecting to master
>>> spark://my-ip:7077...
>>> 16/01/05 14:24:34 INFO AppClient$ClientEndpoint: Connecting to master
>>> spark://my-ip:7077...
>>> 16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
>>> spark://my-ip:7077...
>>> 16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
>>> spark://my-ip:7077...
>>> 16/01/05 14:24:54 WARN TransportChannelHandler: Exception in connection
>>> from my-ip/X.XXX.XX.XX:7077
>>> java.lang.NoSuchMethodError:
>>> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>>> at org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:106)
>>> at
>>> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:586)
>>> at
>>> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:577)
>>> at
>>> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170)
>>> at
>>> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104)
>>> at
>>> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>>> at
>>> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>>> at
>>> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>>> at
>>> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>>> at
>>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>>> at
>>> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>>> at
>>> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>>> at
>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>> at
>>> 

Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Prasad Ravilla
I am using Spark 1.5.2.

I am not using Dynamic allocation.

Thanks,
Prasad.




On 1/5/16, 3:24 AM, "Ted Yu"  wrote:

>Which version of Spark do you use ?
>
>This might be related:
>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D8560=CwICAg=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw=4v0Ji1ymhcVi2Ys2mzOne0cuiDxWMiYmeRYVUeF3hWU=9L2ltekpwnC0BDcJPW43_ctNL_G4qTXN4EY2H_Ys0nU=
> 
>
>Do you use dynamic allocation ?
>
>Cheers
>
>> On Jan 4, 2016, at 10:05 PM, Prasad Ravilla  wrote:
>> 
>> I am seeing negative active tasks in the Spark UI.
>> 
>> Is anyone seeing this?
>> How is this possible?
>> 
>> Thanks,
>> Prasad.
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org


Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Yes, that was the case, the app was built with java 8.
But that was the case with Spark 1.5.2 as well and it didn't complain.

On 5 January 2016 at 16:40, Dean Wampler  wrote:

> ​Still, it would be good to know what happened exactly. Why did the netty
> dependency expect Java 8?  Did you build your app on a machine with Java 8
> and deploy on a Java 7 machine?​
>
> Anyway, I played with the 1.6.0 spark-shell using Java 7 and it worked
> fine. I also looked at the distribution's class files using e.g.,
>
> $ cd $HOME/spark/spark-1.6.0-bin-hadoop2.6
> ​$
>  jar xf lib/spark-assembly-1.6.0-hadoop2.6.0.jar
> org/apache/spark/rpc/netty/Dispatcher.class
> $ javap -classpath . -verbose org.apache.spark.rpc.netty.Dispatcher | grep
> version
>   minor version: 0
>   major version: 50
>
> So, it was compiled with Java 6 (see
> https://en.wikipedia.org/wiki/Java_class_file). So, it doesn't appear to
> be a Spark build issue.
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Tue, Jan 5, 2016 at 9:01 AM, Yiannis Gkoufas 
> wrote:
>
>> Hi Dean,
>>
>> thanks so much for the response! It works without a problem now!
>>
>> On 5 January 2016 at 14:33, Dean Wampler  wrote:
>>
>>> ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method.
>>> The Java 7 method returns a Set. Are you running Java 7? What happens if
>>> you run Java 8?
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>>  (O'Reilly)
>>> Typesafe 
>>> @deanwampler 
>>> http://polyglotprogramming.com
>>>
>>> On Tue, Jan 5, 2016 at 8:29 AM, Yiannis Gkoufas 
>>> wrote:
>>>
 Hi there,

 I have been using Spark 1.5.2 on my cluster without a problem and
 wanted to try Spark 1.6.0.
 I have the exact same configuration on both clusters.
 I am able to start the Standalone Cluster but I fail to submit a job
 getting errors like the following:

 16/01/05 14:24:14 INFO AppClient$ClientEndpoint: Connecting to master
 spark://my-ip:7077...
 16/01/05 14:24:34 INFO AppClient$ClientEndpoint: Connecting to master
 spark://my-ip:7077...
 16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
 spark://my-ip:7077...
 16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
 spark://my-ip:7077...
 16/01/05 14:24:54 WARN TransportChannelHandler: Exception in connection
 from my-ip/X.XXX.XX.XX:7077
 java.lang.NoSuchMethodError:
 java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
 at org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:106)
 at
 org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:586)
 at
 org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:577)
 at
 org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170)
 at
 org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104)
 at
 org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
 at
 org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
 at
 io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
 at
 io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
 at
 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
 at
 org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
 at
 

Re: sparkR ORC support.

2016-01-05 Thread Prem Sure
Yes Sandeep, also copy hive-site.xml too to spark conf directory.


On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
wrote:

> Also, do I need to setup hive in spark as per the link
> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
> ?
>
> We might need to copy hdfs-site.xml file to spark conf directory ?
>
> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
> wrote:
>
>> Deepak
>>
>> Tried this. Getting this error now
>>
>> rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :
>>   unused argument ("")
>>
>>
>> On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
>> wrote:
>>
>>> Hi Sandeep
>>> can you try this ?
>>>
>>> results <- sql(hivecontext, "FROM test SELECT id","")
>>>
>>> Thanks
>>> Deepak
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
>>> wrote:
>>>
 Thanks Deepak.

 I tried this as well. I created a hivecontext   with  "hivecontext <<-
 sparkRHive.init(sc) "  .

 When I tried to read hive table from this ,

 results <- sql(hivecontext, "FROM test SELECT id")

 I get below error,

 Error in callJMethod(sqlContext, "sql", sqlQuery) :
   Invalid jobj 2. If SparkR was restarted, Spark operations need to be 
 re-executed.


 Not sure what is causing this? Any leads or ideas? I am using rstudio.



 On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
 wrote:

> Hi Sandeep
> I am not sure if ORC can be read directly in R.
> But there can be a workaround .First create hive table on top of ORC
> files and then access hive table in R.
>
> Thanks
> Deepak
>
> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
> wrote:
>
>> Hello
>>
>> I need to read an ORC files in hdfs in R using spark. I am not able
>> to find a package to do that.
>>
>> Can anyone help with documentation or example for this purpose?
>>
>> --
>> Architect
>> Infoworks.io
>> http://Infoworks.io
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



 --
 Architect
 Infoworks.io
 http://Infoworks.io

>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Architect
>> Infoworks.io
>> http://Infoworks.io
>>
>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>


Re: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Jean-Baptiste Onofré

Hi Rachana,

don't you have two messages on the kafka broker ?

Regards
JB

On 01/05/2016 05:14 PM, Rachana Srivastava wrote:

I have a very simple two lines program.  I am getting input from Kafka
and save the input in a file and counting the input received.  My code
looks like this, when I run this code I am getting two accumulator count
for each input.

HashMap kafkaParams= *new*HashMap();kafkaParams.put("metadata.broker.list", "localhost:9092");
kafkaParams.put("zookeeper.connect", "localhost:2181");

JavaPairInputDStream messages=
KafkaUtils./createDirectStream/( jssc,String.*class*, String.*class*,
StringDecoder.*class*, StringDecoder.*class*, kafkaParams, topicsSet);

*final**Accumulator **accum**=
**jssc**.sparkContext().accumulator(0);***

JavaDStream lines= messages.map(

*new*_Function, String>()_ {

*public*String call(Tuple2 tuple2) { *accum.add(1);*
*return*tuple2._2();

}});

lines.foreachRDD(*new*_Function()_ {

*public*Void call(JavaRDD rdd) *throws*Exception {

*if*(!rdd.isEmpty() ||
!rdd.partitions().isEmpty()){rdd.saveAsTextFile("hdfs://quickstart.cloudera:8020/user/cloudera/testDirJan4/test1.text");}

System.*/out/*.println(" & COUNT OF ACCUMULATOR IS
"+ *accum.value(*)); *return**null*;}

  });

jssc.start();

If I comment rdd.saveAsTextFile I get correct count, but with
rdd.saveAsTextFile for each input I am getting multiple accumulator count.

Thanks,

Rachana



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
Hi everyone,
considering the new Datasets API, will there be Encoders defined for
reading and writing Avro files ? Will it be possible to use already
generated Avro classes ?

Regards,

-- 
*Olivier Girardot*


Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
I have a very simple two lines program.  I am getting input from Kafka and save 
the input in a file and counting the input received.  My code looks like this, 
when I run this code I am getting two accumulator count for each input.

HashMap kafkaParams = new HashMap();  
kafkaParams.put("metadata.broker.list","localhost:9092");   
kafkaParams.put("zookeeper.connect", "localhost:2181");
JavaPairInputDStream messages = KafkaUtils.createDirectStream( 
jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, 
kafkaParams, topicsSet);
final Accumulator accum = jssc.sparkContext().accumulator(0);
JavaDStream lines = messages.map(
new Function, String>() {
   public String call(Tuple2 tuple2) { 
accum.add(1); return tuple2._2();
} });
lines.foreachRDD(new Function() {
public Void call(JavaRDD rdd) throws Exception {
if(!rdd.isEmpty() || !rdd.partitions().isEmpty()){ 
rdd.saveAsTextFile("hdfs://quickstart.cloudera:8020/user/cloudera/testDirJan4/test1.text");}
System.out.println(" & COUNT OF ACCUMULATOR IS " + 
accum.value()); return null;}
 });
 jssc.start();

If I comment rdd.saveAsTextFile I get correct count, but with 
rdd.saveAsTextFile for each input I am getting multiple accumulator count.

Thanks,

Rachana


Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Annabel Melongo
Vijay,
Are you closing the fileinputstream at the end of each loop ( in.close())? My 
guess is those streams aren't close and thus the "too many open files" 
exception. 

On Tuesday, January 5, 2016 8:03 AM, Priya Ch 
 wrote:
 

 Can some one throw light on this ?
Regards,Padma Ch
On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch  wrote:

Chris, we are using spark 1.3.0 version. we have not set  
spark.streaming.concurrentJobs this parameter. It takes the default value.
Vijay,
  From the tack trace it is evident that 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
 is throwing the exception. I opened the spark source code and visited the line 
which is throwing this exception i.e  


The lie which is marked in red is throwing the exception. The file is 
ExternalSorter.scala in org.apache.spark.util.collection package.
i went through the following blog 
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
 and understood that there is merge factor which decide the number of on-disk 
files that could be merged. Is it some way related to this ?
Regards,Padma CH
On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly  wrote:

and which version of Spark/Spark Streaming are you using?
are you explicitly setting the spark.streaming.concurrentJobs to something 
larger than the default of 1?  
if so, please try setting that back to 1 and see if the problem still exists.  
this is a dangerous parameter to modify from the default - which is why it's 
not well-documented.

On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge  wrote:

Few indicators -
1) during execution time - check total number of open files using lsof command. 
Need root permissions. If it is cluster not sure much !2) which exact line in 
the code is triggering this error ? Can you paste that snippet ?

On Wednesday 23 December 2015, Priya Ch  wrote:

ulimit -n 65000
fs.file-max = 65000 ( in etc/sysctl.conf file)
Thanks,Padma Ch
On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:

Could you share the ulimit for your setup please ? - Thanks, via mobile,  
excuse brevity. On Dec 22, 2015 6:39 PM, "Priya Ch" 
 wrote:

Jakob,     Increased the settings like fs.file-max in /etc/sysctl.conf and also 
increased user limit in /etc/security/limits.conf. But still see the same issue.
On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky  wrote:

It might be a good idea to see how many files are open and try increasing the 
open file limit (this is done on an os level). In some application use-cases it 
is actually a legitimate need.

If that doesn't help, make sure you close any unused files and streams in your 
code. It will also be easier to help diagnose the issue if you send an 
error-reproducing snippet.








-- 
Regards,Vijay Gharge







-- 

Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San 
Francisco, CAhttp://spark.tc | http://advancedspark.com





  
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
+1

Red Hat supports Python 2.6 on REHL 5 until 2020
, but
otherwise yes, Python 2.6 is ancient history and the core Python developers
stopped supporting it in 2013. REHL 5 is not a good enough reason to
continue support for Python 2.6 IMO.

We should aim to support Python 2.7 and Python 3.3+ (which I believe we
currently do).

Nick

On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:

> plus 1,
>
> we are currently using python 2.7.2 in production environment.
>
>
>
>
>
> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>
> +1
> We use Python 2.7
>
> Regards,
>
> Meethu Mathew
>
> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>
>> Does anybody here care about us dropping support for Python 2.6 in Spark
>> 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>> stopped supporting 2.6. We can still convince the library maintainers to
>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>> Python 2.6 to run Spark.
>>
>> Thanks.
>>
>>
>>
>


Re: finding distinct count using dataframe

2016-01-05 Thread Kristina Rogale Plazonic
I think it's an expression, rather than a function you'd find in the API
 (as a function you could do   df.select(col).distinct.count)

This will give you the number of distinct rows in both columns:
scala> df.select(countDistinct("name", "age"))
res397: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name,age): bigint]

Whereas this will give you the number of distinct values in each column:
scala> df.select(countDistinct("name"), countDistinct("age"))
res398: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name): bigint,
COUNT(DISTINCT age): bigint]

Of course, when you need many columns at once, this expression becomes
tedious, so I find it easiest to construct an sql statement from column
names, like so:

df.registerTempTable("df")
val sqlstatement = "select "+ df.columns.map( col => s"count (distinct
$col) as ${col}_distinct").mkString(", ") + " from df"
sqlContext.sql(sqlstatement)

But this is not efficient - see this Jira ticket
and the fix.

On Tue, Jan 5, 2016 at 5:55 AM, Arunkumar Pillai 
wrote:

> Thanks Yanbo,
>
> Thanks for the help. But I'm not able to find countDistinct ot
> approxCountDistinct. function. These functions are within dataframe or any
> other package
>
> On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang  wrote:
>
>> Hi Arunkumar,
>>
>> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
>> approxCountDistinct for a approximate result.
>>
>> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai :
>>
>>> Hi
>>>
>>> Is there any   functions to find distinct count of all the variables in
>>> dataframe.
>>>
>>> val sc = new SparkContext(conf) // spark context
>>> val options = Map("header" -> "true", "delimiter" -> delimiter, 
>>> "inferSchema" -> "true")
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
>>> val datasetDF = 
>>> sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>>>
>>>
>>> we are able to get the schema, variable data type. is there any method to 
>>> get the distinct count ?
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>> Arun
>>>
>>
>>
>
>
> --
> Thanks and Regards
> Arun
>


Re: sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Also, do I need to setup hive in spark as per the link
http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark ?

We might need to copy hdfs-site.xml file to spark conf directory ?

On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
wrote:

> Deepak
>
> Tried this. Getting this error now
>
> rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :
>   unused argument ("")
>
>
> On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
> wrote:
>
>> Hi Sandeep
>> can you try this ?
>>
>> results <- sql(hivecontext, "FROM test SELECT id","")
>>
>> Thanks
>> Deepak
>>
>>
>> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
>> wrote:
>>
>>> Thanks Deepak.
>>>
>>> I tried this as well. I created a hivecontext   with  "hivecontext <<-
>>> sparkRHive.init(sc) "  .
>>>
>>> When I tried to read hive table from this ,
>>>
>>> results <- sql(hivecontext, "FROM test SELECT id")
>>>
>>> I get below error,
>>>
>>> Error in callJMethod(sqlContext, "sql", sqlQuery) :
>>>   Invalid jobj 2. If SparkR was restarted, Spark operations need to be 
>>> re-executed.
>>>
>>>
>>> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>>>
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
>>> wrote:
>>>
 Hi Sandeep
 I am not sure if ORC can be read directly in R.
 But there can be a workaround .First create hive table on top of ORC
 files and then access hive table in R.

 Thanks
 Deepak

 On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
 wrote:

> Hello
>
> I need to read an ORC files in hdfs in R using spark. I am not able to
> find a package to do that.
>
> Can anyone help with documentation or example for this purpose?
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>



 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net

>>>
>>>
>>>
>>> --
>>> Architect
>>> Infoworks.io
>>> http://Infoworks.io
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>



-- 
Architect
Infoworks.io
http://Infoworks.io


Re: Is there a way to use parallelize function in sparkR spark version (1.6.0)

2016-01-05 Thread Ted Yu
Please take a look at the following for examples:

R/pkg/R/RDD.R
R/pkg/R/pairRDD.R

Cheers

On Tue, Jan 5, 2016 at 2:36 AM, Chandan Verma 
wrote:

>
> ===
> DISCLAIMER: The information contained in this message (including any
> attachments) is confidential and may be privileged. If you have received it
> by mistake please notify the sender by return e-mail and permanently delete
> this message and any attachments from your system. Any dissemination, use,
> review, distribution, printing or copying of this message in whole or in
> part is strictly prohibited. Please note that e-mails are susceptible to
> change. CitiusTech shall not be liable for the improper or incomplete
> transmission of the information contained in this communication nor for any
> delay in its receipt or damage to your system. CitiusTech does not
> guarantee that the integrity of this communication has been maintained or
> that this communication is free of viruses, interceptions or interferences.
> 
>
>


Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Hi Dean,

thanks so much for the response! It works without a problem now!

On 5 January 2016 at 14:33, Dean Wampler  wrote:

> ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method. The
> Java 7 method returns a Set. Are you running Java 7? What happens if you
> run Java 8?
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Tue, Jan 5, 2016 at 8:29 AM, Yiannis Gkoufas 
> wrote:
>
>> Hi there,
>>
>> I have been using Spark 1.5.2 on my cluster without a problem and wanted
>> to try Spark 1.6.0.
>> I have the exact same configuration on both clusters.
>> I am able to start the Standalone Cluster but I fail to submit a job
>> getting errors like the following:
>>
>> 16/01/05 14:24:14 INFO AppClient$ClientEndpoint: Connecting to master
>> spark://my-ip:7077...
>> 16/01/05 14:24:34 INFO AppClient$ClientEndpoint: Connecting to master
>> spark://my-ip:7077...
>> 16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
>> spark://my-ip:7077...
>> 16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
>> spark://my-ip:7077...
>> 16/01/05 14:24:54 WARN TransportChannelHandler: Exception in connection
>> from my-ip/X.XXX.XX.XX:7077
>> java.lang.NoSuchMethodError:
>> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>> at org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:106)
>> at
>> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:586)
>> at
>> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:577)
>> at
>> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170)
>> at
>> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104)
>> at
>> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>> at
>> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>> at
>> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>> at
>> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>> at
>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>> at
>> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>> at
>> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>> at
>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> at
>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Has anyone else had similar problems?
>>
>> Thanks a lot
>>
>>
>


Handling futures from foreachPartitionAsync in Spark Streaming

2016-01-05 Thread Trevor
Hi everyone,

I'm working on a spark streaming program where I need to asynchronously
apply a complex function across the partitions of an RDD. I'm currently
using foreachPartitionAsync to achieve this. What is the idiomatic way of
handling the FutureAction that returns from the foreachPartitionAsync call?
Currently I am simply doing:

try {
  Await.ready(future, timeout)
} catch {
  case error: TimeoutException =>
  future.cancel()
  //log the error
}

Is there a better way to handle the possibility of a future timeout? I would
prefer some method of retrying but am not sure how that would work in the
Spark Streaming execution model. Processing order isn't particularly
important to me, so the ability to "come back at a later time" and retry the
batch interval contents would be helpful.

Thanks for any advice!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Handling-futures-from-foreachPartitionAsync-in-Spark-Streaming-tp25883.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Dean Wampler
ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method. The
Java 7 method returns a Set. Are you running Java 7? What happens if you
run Java 8?

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Tue, Jan 5, 2016 at 8:29 AM, Yiannis Gkoufas 
wrote:

> Hi there,
>
> I have been using Spark 1.5.2 on my cluster without a problem and wanted
> to try Spark 1.6.0.
> I have the exact same configuration on both clusters.
> I am able to start the Standalone Cluster but I fail to submit a job
> getting errors like the following:
>
> 16/01/05 14:24:14 INFO AppClient$ClientEndpoint: Connecting to master
> spark://my-ip:7077...
> 16/01/05 14:24:34 INFO AppClient$ClientEndpoint: Connecting to master
> spark://my-ip:7077...
> 16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
> spark://my-ip:7077...
> 16/01/05 14:24:54 INFO AppClient$ClientEndpoint: Connecting to master
> spark://my-ip:7077...
> 16/01/05 14:24:54 WARN TransportChannelHandler: Exception in connection
> from my-ip/X.XXX.XX.XX:7077
> java.lang.NoSuchMethodError:
> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
> at org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:106)
> at
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:586)
> at
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:577)
> at
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170)
> at
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
>
> Has anyone else had similar problems?
>
> Thanks a lot
>
>


Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
I don't understand.  If you're using fair scheduling and don't set a pool,
the default pool will be used.

On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang  wrote:

>
> It seems currently spark.scheduler.pool must be set as localProperties
> (associate with thread). Any reason why spark.scheduler.pool can not be
> used globally.  My scenario is that I want my thriftserver started with
> fair scheduler as the default pool without using set command to set the
> pool. Is there anyway to do that ? Or do I miss anything here ?
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
rhel/centos 6 ships with python 2.6, doesnt it?

if so, i still know plenty of large companies where python 2.6 is the only
option. asking them for python 2.7 is not going to work

so i think its a bad idea

On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland 
wrote:

> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
> point, Python 3 should be the default that is encouraged.
> Most organizations acknowledge the 2.7 is common, but lagging behind the
> version they should theoretically use. Dropping python 2.6
> support sounds very reasonable to me.
>
> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> +1
>>
>> Red Hat supports Python 2.6 on REHL 5 until 2020
>> , but
>> otherwise yes, Python 2.6 is ancient history and the core Python developers
>> stopped supporting it in 2013. REHL 5 is not a good enough reason to
>> continue support for Python 2.6 IMO.
>>
>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>> currently do).
>>
>> Nick
>>
>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:
>>
>>> plus 1,
>>>
>>> we are currently using python 2.7.2 in production environment.
>>>
>>>
>>>
>>>
>>>
>>> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>>>
>>> +1
>>> We use Python 2.7
>>>
>>> Regards,
>>>
>>> Meethu Mathew
>>>
>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
>>> wrote:
>>>
 Does anybody here care about us dropping support for Python 2.6 in
 Spark 2.0?

 Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
 parsing) when compared with Python 2.7. Some libraries that Spark depend on
 stopped supporting 2.6. We can still convince the library maintainers to
 support 2.6, but it will be extra work. I'm curious if anybody still uses
 Python 2.6 to run Spark.

 Thanks.



>>>
>


Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Shixiong(Ryan) Zhu
Did you enable "spark.speculation"?

On Tue, Jan 5, 2016 at 9:14 AM, Prasad Ravilla  wrote:

> I am using Spark 1.5.2.
>
> I am not using Dynamic allocation.
>
> Thanks,
> Prasad.
>
>
>
>
> On 1/5/16, 3:24 AM, "Ted Yu"  wrote:
>
> >Which version of Spark do you use ?
> >
> >This might be related:
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D8560=CwICAg=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw=4v0Ji1ymhcVi2Ys2mzOne0cuiDxWMiYmeRYVUeF3hWU=9L2ltekpwnC0BDcJPW43_ctNL_G4qTXN4EY2H_Ys0nU=
> >
> >Do you use dynamic allocation ?
> >
> >Cheers
> >
> >> On Jan 4, 2016, at 10:05 PM, Prasad Ravilla  wrote:
> >>
> >> I am seeing negative active tasks in the Spark UI.
> >>
> >> Is anyone seeing this?
> >> How is this possible?
> >>
> >> Thanks,
> >> Prasad.
> >> 
> >> 
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Shixiong(Ryan) Zhu
Hey Rachana,

There are two jobs in your codes actually: `rdd.isEmpty` and
`rdd.saveAsTextFile`. Since you don't cache or checkpoint this rdd, it will
execute your map function twice for each record.

You can move "accum.add(1)" to "rdd.saveAsTextFile" like this:

JavaDStream lines = messages.map(
new Function, String>() {
  public String call(Tuple2 tuple2) {
LOG.info("#  Input json stream data
 # " + tuple2._2);
return tuple2._2();
  }
});
lines.foreachRDD(new Function() {
  public Void call(JavaRDD rdd) throws Exception {
if (!rdd.isEmpty() || !rdd.partitions().isEmpty()) {
  rdd.map(new Function() {
public String call(String str) {
  accum.add(1);
  return str;
}

}).saveAsTextFile("hdfs://quickstart.cloudera:8020/user/cloudera/testDirJan4/test1.text");
}
System.out.println(" & COUNT OF ACCUMULATOR IS
" + accum.value());
return null;
  }
});



On Tue, Jan 5, 2016 at 8:37 AM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> Thanks a lot for your prompt response.  I am pushing one message.
>
>
>
> HashMap kafkaParams = *new* HashMap();
> kafkaParams.put("metadata.broker.list","localhost:9092");
> kafkaParams.put("zookeeper.connect", "localhost:2181");
>
> JavaPairInputDStream messages = KafkaUtils.
> *createDirectStream*( jssc, String.*class*, String.*class*, StringDecoder.
> *class*, StringDecoder.*class*, kafkaParams, topicsSet);
>
> *final** Accumulator **accum** = **jssc*
> *.sparkContext().accumulator(0);*
>
> JavaDStream lines = messages.map(
>
> *new* *Function, String>()* {
>
>*public* String call(Tuple2 tuple2) { *LOG*
> .info("#  Input json stream data  # " +
> tuple2._2);*accum**.add(1);* *return* tuple2._2();
>
> } });
>
> lines.foreachRDD(*new* *Function()* {
>
> *public* Void call(JavaRDD rdd) *throws* Exception {
>
> *if*(!rdd.isEmpty() || !rdd.partitions().isEmpty()){ rdd.saveAsTextFile(
> "hdfs://quickstart.cloudera:8020/user/cloudera/testDirJan4/test1.text");}
>
> System.*out*.println(" & COUNT OF ACCUMULATOR IS " +
> *accum**.value(*)); *return* *null*;}
>
>  });
>
>  jssc.start();
>
>
>
> If I remove this saveAsTextFile I get correct count with this line I am
> getting double counting.
>
>
>
> *Here are the Stack trace with SaveAsText statement Please see double
> counting below:*
>
>
>
> &&& BEFORE COUNT OF ACCUMULATOR IS &&& 0
>
> INFO : org.apache.spark.SparkContext - Starting job: foreachRDD at
> KafkaURLStreaming.java:90
>
> INFO : org.apache.spark.scheduler.DAGScheduler - Got job 0 (foreachRDD at
> KafkaURLStreaming.java:90) with 1 output partitions
>
> INFO : org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage
> 0(foreachRDD at KafkaURLStreaming.java:90)
>
> INFO : org.apache.spark.scheduler.DAGScheduler - Parents of final stage:
> List()
>
> INFO : org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
>
> INFO : org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 0
> (MapPartitionsRDD[1] at map at KafkaURLStreaming.java:83), which has no
> missing parents
>
> INFO : org.apache.spark.storage.MemoryStore - ensureFreeSpace(3856) called
> with curMem=0, maxMem=1893865881
>
> INFO : org.apache.spark.storage.MemoryStore - Block broadcast_0 stored as
> values in memory (estimated size 3.8 KB, free 1806.1 MB)
>
> INFO : org.apache.spark.storage.MemoryStore - ensureFreeSpace(2225) called
> with curMem=3856, maxMem=1893865881
>
> INFO : org.apache.spark.storage.MemoryStore - Block broadcast_0_piece0
> stored as bytes in memory (estimated size 2.2 KB, free 1806.1 MB)
>
> INFO : org.apache.spark.storage.BlockManagerInfo - Added
> broadcast_0_piece0 in memory on localhost:51637 (size: 2.2 KB, free: 1806.1
> MB)
>
> INFO : org.apache.spark.SparkContext - Created broadcast 0 from broadcast
> at DAGScheduler.scala:861
>
> INFO : org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing
> tasks from ResultStage 0 (MapPartitionsRDD[1] at map at
> KafkaURLStreaming.java:83)
>
> INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 0.0
> with 1 tasks
>
> INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in
> stage 0.0 (TID 0, localhost, ANY, 2026 bytes)
>
> INFO : org.apache.spark.executor.Executor - Running task 0.0 in stage 0.0
> (TID 0)
>
> INFO : org.apache.spark.streaming.kafka.KafkaRDD - Computing topic test11,
> partition 0 offsets 36 -> 37
>
> INFO : kafka.utils.VerifiableProperties - Verifying properties
>
> INFO : kafka.utils.VerifiableProperties - Property fetch.message.max.bytes
> is 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Ted Yu
+1

> On Jan 5, 2016, at 10:49 AM, Davies Liu  wrote:
> 
> +1
> 
> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas
>  wrote:
>> +1
>> 
>> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python
>> 2.6 is ancient history and the core Python developers stopped supporting it
>> in 2013. REHL 5 is not a good enough reason to continue support for Python
>> 2.6 IMO.
>> 
>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>> currently do).
>> 
>> Nick
>> 
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:
>>> 
>>> plus 1,
>>> 
>>> we are currently using python 2.7.2 in production environment.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>>> 
>>> +1
>>> We use Python 2.7
>>> 
>>> Regards,
>>> 
>>> Meethu Mathew
>>> 
 On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
 
 Does anybody here care about us dropping support for Python 2.6 in Spark
 2.0?
 
 Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
 parsing) when compared with Python 2.7. Some libraries that Spark depend on
 stopped supporting 2.6. We can still convince the library maintainers to
 support 2.6, but it will be extra work. I'm curious if anybody still uses
 Python 2.6 to run Spark.
 
 Thanks.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Michael Armbrust
You could try with the `Encoders.bean` method.  It detects classes that
have getters and setters.  Please report back!

On Tue, Jan 5, 2016 at 9:45 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> considering the new Datasets API, will there be Encoders defined for
> reading and writing Avro files ? Will it be possible to use already
> generated Avro classes ?
>
> Regards,
>
> --
> *Olivier Girardot*
>


Spark on Apache Ingnite?

2016-01-05 Thread unk1102
Hi has anybody tried and had success with Spark on Apache Ignite seems
promising? https://ignite.apache.org/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
+1

On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas
 wrote:
> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python
> 2.6 is ancient history and the core Python developers stopped supporting it
> in 2013. REHL 5 is not a good enough reason to continue support for Python
> 2.6 IMO.
>
> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
> currently do).
>
> Nick
>
> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:
>>
>> plus 1,
>>
>> we are currently using python 2.7.2 in production environment.
>>
>>
>>
>>
>>
>> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>>
>> +1
>> We use Python 2.7
>>
>> Regards,
>>
>> Meethu Mathew
>>
>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>>>
>>> Does anybody here care about us dropping support for Python 2.6 in Spark
>>> 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-05 Thread Andy Davidson
Hi Michael

I am not sure you under stand my code correct.

I am trying to implement  org.apache.spark.ml.Transformer interface in Java
8.


My understanding is the sudo code for transformers is something like
@Override

public DataFrame transform(DataFrame df) {

1. Select the input column

2. Create a new column

3. Append the new column to the df argument and return

   }



Based on my experience the current DataFrame api is very limited. You can
not apply a complicated lambda function. As a work around I convert the data
frame to a JavaRDD, apply my complicated lambda, and then convert the
resulting RDD back to a Data Frame.



Now I select the ³new column² from the Data Frame and try to call
df.withColumn().



I can try an implement this as a UDF. How ever I need to use several 3rd
party jars. Any idea how insure the workers will have the required jar
files? If I was submitting a normal java app I would create an uber jar will
this work with UDFs?



Kind regards



Andy



From:  Michael Armbrust 
Date:  Monday, January 4, 2016 at 11:14 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: problem with DataFrame df.withColumn()
org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

> Its not really possible to convert an RDD to a Column.  You can think of a
> Column as an expression that produces a single output given some set of input
> columns.  If I understand your code correctly, I think this might be easier to
> express as a UDF:
> sqlContext.udf().register("stem", new UDF1() {
>   @Override
>   public String call(String str) {
> return // TODO: stemming code here
>   }
> }, DataTypes.StringType);
> DataFrame transformed = df.withColumn("filteredInput",
> expr("stem(rawInput)"));
> 
> On Mon, Jan 4, 2016 at 8:08 PM, Andy Davidson 
> wrote:
>> I am having a heck of a time writing a simple transformer in Java. I assume
>> that my Transformer is supposed to append a new column to the dataFrame
>> argument. Any idea why I get the following exception in Java 8 when I try to
>> call DataFrame withColumn()? The JavaDoc says withColumn() "Returns a new
>> DataFrame 
>> > html>  by adding a column or replacing the existing column that has the same
>> name.²
>> 
>> 
>> Also do transformers always run in the driver? If not I assume workers do not
>> have the sqlContext. Any idea how I can convert an javaRDD<> to a Column with
>> out a sqlContext?
>> 
>> Kind regards
>> 
>> Andy
>> 
>> P.s. I am using spark 1.6.0
>> 
>> org.apache.spark.sql.AnalysisException: resolved attribute(s)
>> filteredOutput#1 missing from rawInput#0 in operator !Project
>> [rawInput#0,filteredOutput#1 AS filteredOutput#2];
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(Check
>> Analysis.scala:38)
>> at 
>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:4
>> 4)
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1
>> .apply(CheckAnalysis.scala:183)
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1
>> .apply(CheckAnalysis.scala:50)
>> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(Chec
>> kAnalysis.scala:50)
>> at 
>> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:
>> 44)
>> at 
>> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.s
>> cala:34)
>> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>> at org.apache.spark.sql.DataFrame.org
>> 
>> $apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751)
>> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1227)
>> at com.pws.poc.ml.StemmerTransformer.transform(StemmerTransformer.java:110)
>> at com.pws.poc.ml.StemmerTransformerTest.test(StemmerTransformerTest.java:45)
>> 
>> 
>> 
>> public class StemmerTransformer extends Transformer implements Serializable {
>> 
>>String inputCol; // unit test sets to rawInput
>>String outputCol; // unit test sets to filteredOutput
>> 
>>   Š
>> 
>> 
>>   public StemmerTransformer(SQLContext sqlContext) {
>> 
>> // will only work if transformers execute in the driver
>> 
>> this.sqlContext = sqlContext;
>> 
>> }
>> 
>> 
>>  @Override
>> 
>> public DataFrame transform(DataFrame df) {
>> 
>> df.printSchema();
>> 
>> df.show();
>> 
>> 
>> 
>> JavaRDD inRowRDD = df.select(inputCol).javaRDD();
>> 
>> JavaRDD outRowRDD = inRowRDD.map((Row row) -> {
>> 
>> // TODO add stemming code
>> 
>> // Create a new Row
>> 

spark-itemsimilarity No FileSystem for scheme error

2016-01-05 Thread roy
Hi we are using CDH 5.4.0 with Spark 1.5.2 (doesn't come with CDH 5.4.0)


I am following this link
https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html to
trying to test/create new algorithm with mahout item-similarity.

I am running following command 

./bin/mahout spark-itemsimilarity \
--input $INPUT \
--output $OUTPUT \
--filter1 o --filter2 v \
--inDelim "\t" \
 --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 \
 --master yarn-client \
 -D:fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem \
 -D:fs.file.impl=org.apache.hadoop.fs.LocalFileSystem

I am getting following error  
 
java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at org.apache.spark.deploy.yarn.Client.cleanupStagingDir(Client.scala:143)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:129)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:523)
at
org.apache.mahout.sparkbindings.package$.mahoutSparkContext(package.scala:91)
at
org.apache.mahout.drivers.MahoutSparkDriver.start(MahoutSparkDriver.scala:83)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.start(ItemSimilarityDriver.scala:118)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:199)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:112)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:110)
at scala.Option.map(Option.scala:145)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:110)
at
org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala)


I found solution here by adding following properties to into
/etc/hadoop/conf/core-site.xml on client/gateway machine more info 


  fs.file.impl
  org.apache.hadoop.fs.LocalFileSystem
  The FileSystem for file: uris.



  fs.hdfs.impl
  org.apache.hadoop.hdfs.DistributedFileSystem
  The FileSystem for hdfs: uris. 
 

 But is there any better way to solve this error ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-itemsimilarity-No-FileSystem-for-scheme-error-tp25887.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Deepak

Tried this. Getting this error now

rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :
  unused argument ("")


On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma  wrote:

> Hi Sandeep
> can you try this ?
>
> results <- sql(hivecontext, "FROM test SELECT id","")
>
> Thanks
> Deepak
>
>
> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
> wrote:
>
>> Thanks Deepak.
>>
>> I tried this as well. I created a hivecontext   with  "hivecontext <<-
>> sparkRHive.init(sc) "  .
>>
>> When I tried to read hive table from this ,
>>
>> results <- sql(hivecontext, "FROM test SELECT id")
>>
>> I get below error,
>>
>> Error in callJMethod(sqlContext, "sql", sqlQuery) :
>>   Invalid jobj 2. If SparkR was restarted, Spark operations need to be 
>> re-executed.
>>
>>
>> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>>
>>
>>
>> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
>> wrote:
>>
>>> Hi Sandeep
>>> I am not sure if ORC can be read directly in R.
>>> But there can be a workaround .First create hive table on top of ORC
>>> files and then access hive table in R.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
>>> wrote:
>>>
 Hello

 I need to read an ORC files in hdfs in R using spark. I am not able to
 find a package to do that.

 Can anyone help with documentation or example for this purpose?

 --
 Architect
 Infoworks.io
 http://Infoworks.io

>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Architect
>> Infoworks.io
>> http://Infoworks.io
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



-- 
Architect
Infoworks.io
http://Infoworks.io


Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Igor Berman
another option will be to try
rdd.toLocalIterator()
not sure if it will help though

I had same problem and ended up to move all parts to local disk(with Hadoop
FileSystem api) and then processing them locally


On 5 January 2016 at 22:08, Alexander Pivovarov 
wrote:

> try coalesce(1, true).
>
> On Tue, Jan 5, 2016 at 11:58 AM, unk1102  wrote:
>
>> hi I am trying to save many partitions of Dataframe into one CSV file and
>> it
>> take forever for large data sets of around 5-6 GB.
>>
>>
>> sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop")
>>
>> For small data above code works well but for large data it hangs forever
>> does not move on because of only one partitions has to shuffle data of GBs
>> please help me
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread unk1102
hi I am trying to save many partitions of Dataframe into one CSV file and it
take forever for large data sets of around 5-6 GB.

sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop")

For small data above code works well but for large data it hangs forever
does not move on because of only one partitions has to shuffle data of GBs
please help me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Alexander Pivovarov
try coalesce(1, true).

On Tue, Jan 5, 2016 at 11:58 AM, unk1102  wrote:

> hi I am trying to save many partitions of Dataframe into one CSV file and
> it
> take forever for large data sets of around 5-6 GB.
>
>
> sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop")
>
> For small data above code works well but for large data it hangs forever
> does not move on because of only one partitions has to shuffle data of GBs
> please help me
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Out of memory issue

2016-01-05 Thread babloo80
Hello there,

I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in
different stages of execution and creates a result parquet of 9 GB (about 27
million rows containing 165 columns. some columns are map based containing
utmost 200 value histograms). The stages involve,
Step 1: Reading the data using dataframe api 
Step 2: Transform dataframe to RDD (as the some of the columns are
transformed into histograms (using empirical distribution to cap the number
of keys) and some of them run like UDAF during reduce-by-key step) to
perform and perform some transformations 
Step 3: Reduce the result by key so that the resultant can be used in the
next stage for join
Step 4: Perform left outer join of this result which runs similar Steps 1
thru 3. 
Step 5: The results are further reduced to be written to parquet

With Apache Spark 1.5.2, I am able to run the job with no issues.
Current env uses 8 nodes running a total of  320 cores, 100 GB executor
memory per node with driver program using 32 GB. The approximate execution
time is about 1.2 hrs. The parquet files are stored in another HDFS cluster
for read and eventual write of the result.

When the same job is executed using Apache 1.6.0, some of the executor
node's JVM gets restarted (with a new executor id). On further turning-on GC
stats on the executor, the perm-gen seem to get maxed out and ends up
showing the symptom of out-of-memory. 

Please advice on where to start investigating this issue. 

Thanks,
Muthu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark on Apache Ingnite?

2016-01-05 Thread Umesh Kacha
Hi  Nate thanks much. I have exact same use cases mentioned by you. My
spark job does heavy writing involving  group by and huge data shuffling.
Can you please provide any pointer how can I run my existing spark job
which is running on yarn to make it run on ignite? Please guide. Thanks
again.
On Jan 6, 2016 02:28,  wrote:

> We started playing with Ignite back Hadoop, hive and spark services, and
> looking to move to it as our default for deployment going forward, still
> early but so far its been pretty nice and excited for the flexibility it
> will provide for our particular use cases.
>
> Would say in general its worth looking into if your data workloads are:
>
> a) mix of read/write, or heavy write at times
> b) want write/read access to data from services/apps outside of your spark
> workloads (old Hadoop jobs, custom apps, etc)
> c) have strings of spark jobs that could benefit from caching your data
> across them (think similar usage to tachyon)
> d) you have sparksql queries that could benefit from indexing and
> mutability
> (see pt (a) about mix read/write)
>
> If your data is read exclusive and very batch oriented, and your workloads
> are strictly spark based, benefits will be less and ignite would probably
> act as more of a tachyon replacement as many of the other features outside
> of RDD caching wont be leveraged.
>
>
> -Original Message-
> From: unk1102 [mailto:umesh.ka...@gmail.com]
> Sent: Tuesday, January 5, 2016 10:15 AM
> To: user@spark.apache.org
> Subject: Spark on Apache Ingnite?
>
> Hi has anybody tried and had success with Spark on Apache Ignite seems
> promising? https://ignite.apache.org/
>
>
>
> --
> View this message in context:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-
> tp25884.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
>


Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
I tried the Ted's solution and it works.   But I keep hitting the JVM out
of memory problem.
And grouping the key causes a lot of  data shuffling.

So I am trying to order the data based on ID first and save as Parquet.  Is
there way to make sure that the data is partitioned that each ID's data is
in one partition, so there would be no shuffling in the future?

Thanks.


On Tue, Jan 5, 2016 at 3:19 PM, Michael Armbrust 
wrote:

> This would also be possible with an Aggregator in Spark 1.6:
>
> https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html
>
> On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu  wrote:
>
>> Something like the following:
>>
>> val zeroValue = collection.mutable.Set[String]()
>>
>> val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v,
>> (setOne, setTwo) => setOne ++= setTwo)
>>
>> On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue  wrote:
>>
>>> Hey,
>>>
>>> For example, a table df with two columns
>>> id  name
>>> 1   abc
>>> 1   bdf
>>> 2   ab
>>> 2   cd
>>>
>>> I want to group by the id and concat the string into array of string.
>>> like this
>>>
>>> id
>>> 1 [abc,bdf]
>>> 2 [ab, cd]
>>>
>>> How could I achieve this in dataframe?  I stuck on df.groupBy("id"). ???
>>>
>>> Thanks
>>>
>>>
>>
>


Re: How to accelerate reading json file?

2016-01-05 Thread VISHNU SUBRAMANIAN
HI ,

You can try this

sqlContext.read.format("json").option("samplingRatio","0.1").load("path")

If it still takes time , feel free to experiment with the samplingRatio.

Thanks,
Vishnu

On Wed, Jan 6, 2016 at 12:43 PM, Gavin Yue  wrote:

> I am trying to read json files following the example:
>
> val path = "examples/src/main/resources/jsonfile"val people = 
> sqlContext.read.json(path)
>
> I have 1 Tb size files in the path.  It took 1.2 hours to finish the reading 
> to infer the schema.
>
> But I already know the schema. Could I make this process short?
>
> Thanks a lot.
>
>
>
>


Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
I found that in 1.6 dataframe could do repartition.

Should I still need to do orderby first or I just have to repartition?




On Tue, Jan 5, 2016 at 9:25 PM, Gavin Yue  wrote:

> I tried the Ted's solution and it works.   But I keep hitting the JVM out
> of memory problem.
> And grouping the key causes a lot of  data shuffling.
>
> So I am trying to order the data based on ID first and save as Parquet.
> Is there way to make sure that the data is partitioned that each ID's data
> is in one partition, so there would be no shuffling in the future?
>
> Thanks.
>
>
> On Tue, Jan 5, 2016 at 3:19 PM, Michael Armbrust 
> wrote:
>
>> This would also be possible with an Aggregator in Spark 1.6:
>>
>> https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html
>>
>> On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu  wrote:
>>
>>> Something like the following:
>>>
>>> val zeroValue = collection.mutable.Set[String]()
>>>
>>> val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v,
>>> (setOne, setTwo) => setOne ++= setTwo)
>>>
>>> On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue 
>>> wrote:
>>>
 Hey,

 For example, a table df with two columns
 id  name
 1   abc
 1   bdf
 2   ab
 2   cd

 I want to group by the id and concat the string into array of string.
 like this

 id
 1 [abc,bdf]
 2 [ab, cd]

 How could I achieve this in dataframe?  I stuck on df.groupBy("id"). ???

 Thanks


>>>
>>
>


Re: 101 question on external metastore

2016-01-05 Thread Yana Kadiyska
Deenar, I have not resolved this issue. Why do you think it's from
different versions of Derby? I was playing with this as a fun experiment
and my setup was on a clean machine -- no other versions of
hive/hadoop/etc...

On Sun, Dec 20, 2015 at 12:17 AM, Deenar Toraskar  wrote:

> apparently it is down to different versions of derby in the classpath, but
> i am unsure where the other version is coming from. The setup worked
> perfectly with spark 1.3.1.
>
> Deenar
>
> On 20 December 2015 at 04:41, Deenar Toraskar 
> wrote:
>
>> Hi Yana/All
>>
>> I am getting the same exception. Did you make any progress?
>>
>> Deenar
>>
>> On 5 November 2015 at 17:32, Yana Kadiyska 
>> wrote:
>>
>>> Hi folks, trying experiment with a minimal external metastore.
>>>
>>> I am following the instructions here:
>>> https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode
>>>
>>> I grabbed Derby 10.12.1.1 and started an instance, verified I can
>>> connect via ij tool and that process is listening on 1527
>>>
>>> put the following hive-site.xml under conf
>>> ```
>>> 
>>> 
>>> 
>>> 
>>>   javax.jdo.option.ConnectionURL
>>>   jdbc:derby://localhost:1527/metastore_db;create=true
>>>   JDBC connect string for a JDBC metastore
>>> 
>>> 
>>>   javax.jdo.option.ConnectionDriverName
>>>   org.apache.derby.jdbc.ClientDriver
>>>   Driver class name for a JDBC metastore
>>> 
>>> 
>>> ```
>>>
>>> I then try to run spark-shell thusly:
>>> bin/spark-shell --driver-class-path
>>> /home/yana/db-derby-10.12.1.1-bin/lib/derbyclient.jar
>>>
>>> and I get an ugly stack trace like so...
>>>
>>> Caused by: java.lang.NoClassDefFoundError: Could not initialize class
>>> org.apache.derby.jdbc.EmbeddedDriver
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> at java.lang.Class.newInstance(Class.java:379)
>>> at
>>> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>>> at
>>> org.datanucleus.store.rdbms.connectionpool.DBCPConnectionPoolFactory.createConnectionPool(DBCPConnectionPoolFactory.java:50)
>>> at
>>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>>> at
>>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>>> at
>>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>>> ... 114 more
>>>
>>> :10: error: not found: value sqlContext
>>>import sqlContext.implicits._
>>>
>>>
>>> What am I doing wrong -- not sure why it's looking for Embedded
>>> anything, I'm specifically trying to not use the embedded server...but I
>>> know my hive-site is being read as starting witout --driver-class-path does
>>> say it can't load org.apache.derby.jdbc.ClientDriver
>>>
>>
>>
>


Re: pyspark Dataframe and histogram through ggplot (python)

2016-01-05 Thread Felix Cheung
Hi,
select() returns a new Spark DataFrame; I would imagine ggplot would not work 
with it. Could you try df.select("something").toPandas()?


_
From: Snehotosh Banerjee 
Sent: Tuesday, January 5, 2016 4:32 AM
Subject: pyspark Dataframe and histogram through ggplot (python)
To:  


   Hi,   
  I am facing issue in rendering charts through ggplot while working on 
pyspark Dataframe on a dummy dataset.
 I have created a Spark Dataframe and trying to draw a histogram through ggplot 
in python.  

  
 
I have a valid schema as below.But, below command is not working
 ggplot(df_no_null, 
aes(df_no_null.select('total_price_excluding_optional_support'))) + 
geom_histogram()
 
 Appreciate input.
  
  Regards,  Snehotosh   


  

RE: aggregateByKey vs combineByKey

2016-01-05 Thread LINChen
Hi Marco,In your case, since you don't need to perform an aggregation (such as 
a sum or average) over each key, using groupByKey may perform better. 
groupByKey inherently utilizes compactBuffer which is much more efficient than 
ArrayBuffer.
Thanks.LIN Chen

Date: Tue, 5 Jan 2016 21:13:40 +
Subject: aggregateByKey vs combineByKey
From: mmistr...@gmail.com
To: user@spark.apache.org

Hi all
 i have the following dataSet
 kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s
tring)]

It's a simple list of tuples containing (word_length, word)

What i wanted to do was to group the result by key in order to have a result in 
the form

[(word_length_1, [word1, word2, word3], word_length_2, [word4, word5, word6])

so i browsed spark API and was able to get the result i wanted using two 
different
functions
.
scala> kv.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[St
ring], y:List[String]) => x ::: y).collect()
res86: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi, am)), (4,L
ist(test)), (6,List(string)))

and

scala>
scala> kv.aggregateByKey(List[String]())((acc, item) => item :: acc,
 |(acc1, acc2) => acc1 ::: acc2).collect()
 
 
 
res87: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi, am)), (4,L
ist(test)), (6,List(string)))

Now, question is: any advantages of using one instead of the others?
Am i somehow misusing the API for what i want to do?

kind regards
 marco







  

How to accelerate reading json file?

2016-01-05 Thread Gavin Yue
I am trying to read json files following the example:

val path = "examples/src/main/resources/jsonfile"val people =
sqlContext.read.json(path)

I have 1 Tb size files in the path.  It took 1.2 hours to finish the
reading to infer the schema.

But I already know the schema. Could I make this process short?

Thanks a lot.


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jim Lohse
Hey Python 2.6 don't let the door hit you on the way out! haha Drop It 
No Problem


On 01/05/2016 12:17 AM, Reynold Xin wrote:
Does anybody here care about us dropping support for Python 2.6 in 
Spark 2.0?


Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json 
parsing) when compared with Python 2.7. Some libraries that Spark 
depend on stopped supporting 2.6. We can still convince the library 
maintainers to support 2.6, but it will be extra work. I'm curious if 
anybody still uses Python 2.6 to run Spark.


Thanks.





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread ayan guha
Yes there is. It is called window function over partitions.

Equivalent SQL would be:

select * from
 (select a,b,c, rank() over (partition by a order by b) r from df) x
where r = 1

You can register your DF as a temp table and use the sql form. Or, (>Spark
1.4) you can use window methods and their variants in Spark SQL module.

HTH

On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen  wrote:

> Hi,
>
> I am trying to retrieve the rows with a minimum value of a column for each
> group. For example: the following dataframe:
>
> a | b | c
> --
> 1 | 1 | 1
> 1 | 2 | 2
> 1 | 3 | 3
> 2 | 1 | 4
> 2 | 2 | 5
> 2 | 3 | 6
> 3 | 1 | 7
> 3 | 2 | 8
> 3 | 3 | 9
> --
>
> I group by 'a', and want the rows with the smallest 'b', that is, I want
> to return the following dataframe:
>
> a | b | c
> --
> 1 | 1 | 1
> 2 | 1 | 4
> 3 | 1 | 7
> --
>
> The dataframe I have is huge so get the minimum value of b from each group
> and joining on the original dataframe is very expensive. Is there a better
> way to do this?
>
>
> Thanks,
> Wei
>
>


-- 
Best Regards,
Ayan Guha


Re: sparkR ORC support.

2016-01-05 Thread Felix Cheung
Firstly I don't have ORC data to verify but this should work:
df <- loadDF(sqlContext, "data/path", "orc")
Secondly, could you check if sparkR.stop() was called? sparkRHive.init() should 
be called after sparkR.init() - please check if there is any error message 
there.


_
From: Prem Sure 
Sent: Tuesday, January 5, 2016 8:12 AM
Subject: Re: sparkR ORC support.
To: Sandeep Khurana 
Cc: spark users , Deepak Sharma 


   Yes Sandeep, also copy hive-site.xml too to spark conf directory.   
   
   
   On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
 wrote:
   Also, do I need to setup hive in spark as per the link   
http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark ? 

   We might need to copy hdfs-site.xml file to spark conf 
directory ? 
 On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana  
 wrote: 
  Deepak   
  Tried this. Getting this error now
  rror in sql(hivecontext, "FROM CATEGORIES SELECT 
category_id", "") :   unused argument ("")  
 
   On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma
    wrote:  
  Hi Sandeep
 can you try this ? 

   results <- 
sql(hivecontext, "FROM test SELECT id","")  
  
Thanks  
   Deepak   
 

 
 On Tue, Jan 5, 2016 at 5:49 PM, Sandeep 
Khurana wrote:   
Thanks Deepak.  
 
I tried this as 
well. I created a hivecontext   with  "hivecontext <<- sparkRHive.init(sc) "  . 
   
When I tried to 
read hive table from this , 

results <- 
sql(hivecontext, "FROM test SELECT id") 
   
I get below 
error, 
Error in 
callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If SparkR was 
restarted, Spark operations need to be re-executed.  
  Not sure what is causing this? Any leads or ideas? I am 
using rstudio.   

   
   On Tue, Jan 5, 2016 at 5:35 PM, 
Deepak Sharma  wrote:


  Hi Sandeep
   I am not sure if ORC can be read directly in R.  
 
But there can be a workaround .First create hive table on top of ORC files and 
then access hive table in R.
   
   
Thanks  
 Deepak 


 On Tue, Jan 5, 2016 at 
4:57 PM, Sandeep Khurana   
wrote: 
 

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Priya Ch
Yes, the fileinputstream is closed. May be i didn't show in the screen shot
.

As spark implements, sort-based shuffle, there is a parameter called
maximum merge factor which decides the number of files that can be merged
at once and this avoids too many open files. I am suspecting that it is
something related to this.

Can someone confirm on this ?

On Tue, Jan 5, 2016 at 11:19 PM, Annabel Melongo 
wrote:

> Vijay,
>
> Are you closing the fileinputstream at the end of each loop ( in.close())?
> My guess is those streams aren't close and thus the "too many open files"
> exception.
>
>
> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
> learnings.chitt...@gmail.com> wrote:
>
>
> Can some one throw light on this ?
>
> Regards,
> Padma Ch
>
> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch 
> wrote:
>
> Chris, we are using spark 1.3.0 version. we have not set  
> spark.streaming.concurrentJobs
> this parameter. It takes the default value.
>
> Vijay,
>
>   From the tack trace it is evident that 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
> is throwing the exception. I opened the spark source code and visited the
> line which is throwing this exception i.e
>
> [image: Inline image 1]
>
> The lie which is marked in red is throwing the exception. The file is
> ExternalSorter.scala in org.apache.spark.util.collection package.
>
> i went through the following blog
> http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
> and understood that there is merge factor which decide the number of
> on-disk files that could be merged. Is it some way related to this ?
>
> Regards,
> Padma CH
>
> On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly  wrote:
>
> and which version of Spark/Spark Streaming are you using?
>
> are you explicitly setting the spark.streaming.concurrentJobs to
> something larger than the default of 1?
>
> if so, please try setting that back to 1 and see if the problem still
> exists.
>
> this is a dangerous parameter to modify from the default - which is why
> it's not well-documented.
>
>
> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge 
> wrote:
>
> Few indicators -
>
> 1) during execution time - check total number of open files using lsof
> command. Need root permissions. If it is cluster not sure much !
> 2) which exact line in the code is triggering this error ? Can you paste
> that snippet ?
>
>
> On Wednesday 23 December 2015, Priya Ch 
> wrote:
>
> ulimit -n 65000
>
> fs.file-max = 65000 ( in etc/sysctl.conf file)
>
> Thanks,
> Padma Ch
>
> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:
>
> Could you share the ulimit for your setup please ?
> - Thanks, via mobile,  excuse brevity.
> On Dec 22, 2015 6:39 PM, "Priya Ch"  wrote:
>
> Jakob,
>
>Increased the settings like fs.file-max in /etc/sysctl.conf and also
> increased user limit in /etc/security/limits.conf. But still see the same
> issue.
>
> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky 
> wrote:
>
> It might be a good idea to see how many files are open and try increasing
> the open file limit (this is done on an os level). In some application
> use-cases it is actually a legitimate need.
>
> If that doesn't help, make sure you close any unused files and streams in
> your code. It will also be easier to help diagnose the issue if you send an
> error-reproducing snippet.
>
>
>
>
>
> --
> Regards,
> Vijay Gharge
>
>
>
>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>
>
>
>
>
>


Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
The other way to do it is to build a custom version of Spark where you have
changed the value of DEFAULT_SCHEDULING_MODE -- and if you were paying
close attention, I accidentally let it slip that that is what I've done.  I
previously wrote "schedulingMode = DEFAULT_SCHEDULING_MODE -- i.e.
SchedulingMode.FAIR", but that should actually be SchedulingMode.FIFO if
you haven't changed the code:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L65

On Tue, Jan 5, 2016 at 5:29 PM, Jeff Zhang  wrote:

> Right, I can override the root pool in configuration file, Thanks Mark.
>
> On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra 
> wrote:
>
>> Just configure  with
>> FAIR in fairscheduler.xml (or
>> in spark.scheduler.allocation.file if you have over-riden the default name
>> for the config file.)  `buildDefaultPool()` will only build the pool named
>> "default" with the default properties (such as schedulingMode =
>> DEFAULT_SCHEDULING_MODE -- i.e. SchedulingMode.FAIR) if that pool name is
>> not already built (
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L90
>> ).
>>
>>
>> On Tue, Jan 5, 2016 at 4:15 PM, Jeff Zhang  wrote:
>>
>>> Sorry, I don't make it clearly. What I want is the default pool is fair
>>> scheduling. But seems if I want to use fair scheduling now, I have to set
>>> spark.scheduler.pool explicitly.
>>>
>>> On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra 
>>> wrote:
>>>
 I don't understand.  If you're using fair scheduling and don't set a
 pool, the default pool will be used.

 On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang  wrote:

>
> It seems currently spark.scheduler.pool must be set as localProperties
> (associate with thread). Any reason why spark.scheduler.pool can not be
> used globally.  My scenario is that I want my thriftserver started with
> fair scheduler as the default pool without using set command to set the
> pool. Is there anyway to do that ? Or do I miss anything here ?
>
> --
> Best Regards
>
> Jeff Zhang
>


>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: How to use Java8

2016-01-05 Thread Andy Davidson

Hi Sea

From:  Sea <261810...@qq.com>
Date:  Tuesday, January 5, 2016 at 6:16 PM
To:  "user @spark" 
Subject:  How to use Java8

> Hi, all
> I want to support java8, I use JDK1.8.0_65 in production environment, but
> it doesn't work. Should I build spark using jdk1.8, and set
> 1.8 in pom.xml?
> 
> java.lang.UnsupportedClassVersionError:  Unsupported major.minor version 52.



Here are some notes I wrote about how to configure my data center to use
java 8. You’ll probably need to do something like this

Your mileage may vary

Andy

Setting Java_HOME
ref: configure env vars
 
install java 8 on all nodes (master and slave)
install java 1.8 on master
$ ssh -i $KEY_FILE root@$SPARK_MASTER
# ?? how was this package download from oracle? curl?
yum install jdk-8u65-linux-x64.rpm
copy rpm to slaves and install java 1.8 on slaves
for i in `cat /root/spark-ec2/slaves`;do scp
/home/ec2-user/jdk-8u65-linux-x64.rpm $i:; done
pssh -i -h /root/spark-ec2/slaves ls -l
pssh -i -h /root/spark-ec2/slaves yum install -y jdk-8u65-linux-x64.rpm
remove rpm from slaves. It is 153M
pssh -i -h /root/spark-ec2/slaves rm jdk-8u65-linux-x64.rpm
Configure spark to use java 1.8
ref: configure env vars
 
Make a back up of of config file
cp /root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date
+%Y-%m-%d:%H:%M:%S`

pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh
/root/spark/conf/spark-env.sh-`date +%Y-%m-%d:%H:%M:%S`

pssh -i -h /root/spark-ec2/slaves ls "/root/spark/conf/spark-env.sh*"
Edit /root/spark/conf/spark-env.sh, add
 export JAVA_HOME=/usr/java/latest
Copy spark-env.sh to slaves
pssh -i -h /root/spark-ec2/slaves grep JAVA_HOME
/root/spark/conf/spark-env.sh

for i in `cat /root/spark-ec2/slaves`;do scp /root/spark/conf/spark-env.sh
$i:/root/spark/conf/spark-env.sh; done

pssh -i -h /root/spark-ec2/slaves grep JAVA_HOME
/root/spark/conf/spark-env.sh

> 




?????? How to use Java8

2016-01-05 Thread Sea
thanks




--  --
??: "Andy Davidson";;
: 2016??1??6??(??) 11:04
??: "Sea"<261810...@qq.com>; "user"; 

: Re: How to use Java8





Hi Sea


From:  Sea <261810...@qq.com>
Date:  Tuesday, January 5, 2016 at 6:16 PM
To:  "user @spark" 
Subject:  How to use Java8



Hi, all
I want to support java8, I use JDK1.8.0_65 in production environment, but 
it doesn't work. Should I build spark using jdk1.8, and set 
1.8 in pom.xml?


java.lang.UnsupportedClassVersionError:  Unsupported major.minor version 52.






Here are some notes I wrote about how to configure my data center to use java 
8. You??ll probably need to do something like this


Your mileage may vary


Andy



Setting Java_HOME

ref: configure env vars

install java 8 on all nodes (master and slave)

install java 1.8 on master
$ ssh -i $KEY_FILE root@$SPARK_MASTER # ?? how was this package download from 
oracle? curl? yum install jdk-8u65-linux-x64.rpm 
copy rpm to slaves and install java 1.8 on slaves
for i in `cat /root/spark-ec2/slaves`;do scp 
/home/ec2-user/jdk-8u65-linux-x64.rpm $i:; done pssh -i -h 
/root/spark-ec2/slaves ls -l pssh -i -h /root/spark-ec2/slaves yum install -y 
jdk-8u65-linux-x64.rpm 
remove rpm from slaves. It is 153M
pssh -i -h /root/spark-ec2/slaves rm jdk-8u65-linux-x64.rpm 
Configure spark to use java 1.8

ref: configure env vars

Make a back up of of config file
cp /root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date 
+%Y-%m-%d:%H:%M:%S` pssh -i -h /root/spark-ec2/slaves cp 
/root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date 
+%Y-%m-%d:%H:%M:%S` pssh -i -h /root/spark-ec2/slaves ls 
"/root/spark/conf/spark-env.sh*" 
Edit /root/spark/conf/spark-env.sh, add
 export JAVA_HOME=/usr/java/latest
Copy spark-env.sh to slaves
pssh -i -h /root/spark-ec2/slaves grep JAVA_HOME /root/spark/conf/spark-env.sh 
for i in `cat /root/spark-ec2/slaves`;do scp /root/spark/conf/spark-env.sh 
$i:/root/spark/conf/spark-env.sh; done pssh -i -h /root/spark-ec2/slaves grep 
JAVA_HOME /root/spark/conf/spark-env.sh

Re: UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Tathagata Das
Both mapWithState and updateStateByKey by default uses the HashPartitioner,
and hashes the key in the key-value DStream on which the state operation is
applied. The new data and state is partition in the exact same partitioner,
so that same keys from the new data (from the input DStream) get shuffled
and colocated with the already partitioned state RDDs. So the new data is
brought to the corresponding old state in the same machine and then the
state mapping /updating function is applied. The state is not shuffled
every time, only the batches of new data is shuffled in every batch




On Tue, Jan 5, 2016 at 5:21 PM, Soumitra Johri  wrote:

> Hi,
>
> I am relatively new to Spark and am using updateStateByKey() operation to
> maintain state in my Spark Streaming application. The input data is coming
> through a Kafka topic.
>
>1. I want to understand how are DStreams partitioned?
>2. How does the partitioning work with mapWithState() or
>updateStatebyKey() method?
>3. In updateStateByKey() does the old state and the new values against
>a given key processed on same node ?
>4. How frequent is the shuffle for updateStateByKey() method ?
>
> The state I have to maintaining contains ~ 10 keys and I want to avoid
> shuffle every time I update the state , any tips to do it ?
>
> Warm Regards
> Soumitra
>


Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
Thanks Mark, custom configuration file would be better for me. Changing
code will make it affect all the applications, this is too risky for me.



On Wed, Jan 6, 2016 at 10:50 AM, Mark Hamstra 
wrote:

> The other way to do it is to build a custom version of Spark where you
> have changed the value of DEFAULT_SCHEDULING_MODE -- and if you were
> paying close attention, I accidentally let it slip that that is what I've
> done.  I previously wrote "schedulingMode = DEFAULT_SCHEDULING_MODE --
> i.e. SchedulingMode.FAIR", but that should actually be SchedulingMode.FIFO
> if you haven't changed the code:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L65
>
> On Tue, Jan 5, 2016 at 5:29 PM, Jeff Zhang  wrote:
>
>> Right, I can override the root pool in configuration file, Thanks Mark.
>>
>> On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra 
>> wrote:
>>
>>> Just configure  with
>>> FAIR in fairscheduler.xml (or
>>> in spark.scheduler.allocation.file if you have over-riden the default name
>>> for the config file.)  `buildDefaultPool()` will only build the pool named
>>> "default" with the default properties (such as schedulingMode =
>>> DEFAULT_SCHEDULING_MODE -- i.e. SchedulingMode.FAIR) if that pool name is
>>> not already built (
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L90
>>> ).
>>>
>>>
>>> On Tue, Jan 5, 2016 at 4:15 PM, Jeff Zhang  wrote:
>>>
 Sorry, I don't make it clearly. What I want is the default pool is fair
 scheduling. But seems if I want to use fair scheduling now, I have to set
 spark.scheduler.pool explicitly.

 On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra 
 wrote:

> I don't understand.  If you're using fair scheduling and don't set a
> pool, the default pool will be used.
>
> On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang  wrote:
>
>>
>> It seems currently spark.scheduler.pool must be set as
>> localProperties (associate with thread). Any reason why
>> spark.scheduler.pool can not be used globally.  My scenario is that I 
>> want
>> my thriftserver started with fair scheduler as the default pool without
>> using set command to set the pool. Is there anyway to do that ? Or do I
>> miss anything here ?
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


[Spark-SQL] Custom aggregate function for GrouppedData

2016-01-05 Thread Abhishek Gayakwad
Hello Hivemind,

Referring to this thread -
https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html.
I have learnt that we can not do much with groupped data apart from using
existing aggregate functions. This blog post was written in may 2015, I
don't know if things are changes from that point of time. I am using 1.4
version of spark.

What I am trying to achieve is something very similar to collectset in hive
(actually unique ordered concated values.) e.g.

1,2
1,3
2,4
2,5
2,4

to
1, "2,3"
2, "4,5"

Currently I am achieving this by converting dataframe to RDD, do the
required operations and convert it back to dataframe as shown below.

public class AvailableSizes implements Serializable {

public DataFrame calculate(SQLContext ssc, DataFrame salesDataFrame) {
final JavaRDD rowJavaRDD = salesDataFrame.toJavaRDD();

JavaPairRDD pairs = rowJavaRDD.mapToPair(
(PairFunction) row -> {
final Object[] objects = {row.getAs(0),
row.getAs(1), row.getAs(3)};
return new
Tuple2<>(row.getAs(SalesColumns.STYLE.name()), new
GenericRowWithSchema(objects, SalesColumns.getOutputSchema()));
});

JavaPairRDD withSizeList = pairs.reduceByKey(new
Function2() {
@Override
public Row call(Row aRow, Row bRow) {
final String uniqueCommaSeparatedSizes =
uniqueSizes(aRow, bRow);
final Object[] objects = {aRow.getAs(0),
aRow.getAs(1), uniqueCommaSeparatedSizes};
return new GenericRowWithSchema(objects,
SalesColumns.getOutputSchema());
}

private String uniqueSizes(Row aRow, Row bRow) {
final SortedSet allSizes = new TreeSet<>();
final List aSizes = Arrays.asList(((String)
aRow.getAs(String.valueOf(SalesColumns.SIZE))).split(","));
final List bSizes = Arrays.asList(((String)
bRow.getAs(String.valueOf(SalesColumns.SIZE))).split(","));
allSizes.addAll(aSizes);
allSizes.addAll(bSizes);
return csvFormat(allSizes);
}
});

final JavaRDD values = withSizeList.values();

return ssc.createDataFrame(values, SalesColumns.getOutputSchema());

}

public String csvFormat(Collection collection) {
return 
collection.stream().map(Object::toString).collect(Collectors.joining(","));
}
}

Please suggest if there is a better way of doing this.

Regards,
Abhishek


Re: aggregateByKey vs combineByKey

2016-01-05 Thread Ted Yu
Looking at PairRDDFunctions.scala :

  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner:
Partitioner)(seqOp: (U, V) => U,
  combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
...
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
  cleanedSeqOp, combOp, partitioner)

I think the two operations should be have similar performance.

Cheers

On Tue, Jan 5, 2016 at 1:13 PM, Marco Mistroni  wrote:

> Hi all
>  i have the following dataSet
> kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s tring)]
>
> It's a simple list of tuples containing (word_length, word)
>
> What i wanted to do was to group the result by key in order to have a
> result in the form
>
> [(word_length_1, [word1, word2, word3], word_length_2, [word4, word5,
> word6])
>
> so i browsed spark API and was able to get the result i wanted using two
> different
> functions
> .
>
> scala> kv.combineByKey(List(_), (x:List[String], y:String) => y :: x,
> (x:List[St
>
> ring], y:List[String]) => x ::: y).collect()
>
> res86: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi,
> am)), (4,L
> ist(test)), (6,List(string)))
>
> and
>
> scala>
>
> scala> kv.aggregateByKey(List[String]())((acc, item) => item :: acc,
>
>  |(acc1, acc2) => acc1 ::: acc2).collect()
>
>
>
>
>
>
>
> res87: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi,
> am)), (4,L
> ist(test)), (6,List(string)))
>
> Now, question is: any advantages of using one instead of the others?
> Am i somehow misusing the API for what i want to do?
>
> kind regards
>  marco
>
>
>
>
>
>
>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
As I pointed out in my earlier email, RHEL will support Python 2.6 until
2020. So I'm assuming these large companies will have the option of riding
out Python 2.6 until then.

Are we seriously saying that Spark should likewise support Python 2.6 for
the next several years? Even though the core Python devs stopped supporting
it in 2013?

If that's not what we're suggesting, then when, roughly, can we drop
support? What are the criteria?

I understand the practical concern here. If companies are stuck using 2.6,
it doesn't matter to them that it is deprecated. But balancing that concern
against the maintenance burden on this project, I would say that "upgrade
to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to take.
There are many tiny annoyances one has to put up with to support 2.6.

I suppose if our main PySpark contributors are fine putting up with those
annoyances, then maybe we don't need to drop support just yet...

Nick
2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
작성:

> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6 is
> old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the only
> option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland  > wrote:
>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
>> point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind the
>> version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Red Hat supports Python 2.6 on REHL 5 until 2020
>>> ,
>>> but otherwise yes, Python 2.6 is ancient history and the core Python
>>> developers stopped supporting it in 2013. REHL 5 is not a good enough
>>> reason to continue support for Python 2.6 IMO.
>>>
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>>> currently do).
>>>
>>> Nick
>>>
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
>>> wrote:
>>>
 plus 1,

 we are currently using python 2.7.2 in production environment.





 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:

 +1
 We use Python 2.7

 Regards,

 Meethu Mathew

 On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
 wrote:

> Does anybody here care about us dropping support for Python 2.6 in
> Spark 2.0?
>
> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
> parsing) when compared with Python 2.7. Some libraries that Spark depend 
> on
> stopped supporting 2.6. We can still convince the library maintainers to
> support 2.6, but it will be extra work. I'm curious if anybody still uses
> Python 2.6 to run Spark.
>
> Thanks.
>
>
>

>>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
yeah, the practical concern is that we have no control over java or python
version on large company clusters. our current reality for the vast
majority of them is java 7 and python 2.6, no matter how outdated that is.

i dont like it either, but i cannot change it.

we currently don't use pyspark so i have no stake in this, but if we did i
can assure you we would not upgrade to spark 2.x if python 2.6 was dropped.
no point in developing something that doesnt run for majority of customers.

On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas  wrote:

> As I pointed out in my earlier email, RHEL will support Python 2.6 until
> 2020. So I'm assuming these large companies will have the option of riding
> out Python 2.6 until then.
>
> Are we seriously saying that Spark should likewise support Python 2.6 for
> the next several years? Even though the core Python devs stopped supporting
> it in 2013?
>
> If that's not what we're suggesting, then when, roughly, can we drop
> support? What are the criteria?
>
> I understand the practical concern here. If companies are stuck using 2.6,
> it doesn't matter to them that it is deprecated. But balancing that concern
> against the maintenance burden on this project, I would say that "upgrade
> to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to take.
> There are many tiny annoyances one has to put up with to support 2.6.
>
> I suppose if our main PySpark contributors are fine putting up with those
> annoyances, then maybe we don't need to drop support just yet...
>
> Nick
> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
> 작성:
>
>> Unfortunately, Koert is right.
>>
>> I've been in a couple of projects using Spark (banking industry) where
>> CentOS + Python 2.6 is the toolbox available.
>>
>> That said, I believe it should not be a concern for Spark. Python 2.6 is
>> old and busted, which is totally opposite to the Spark philosophy IMO.
>>
>>
>> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>>
>> rhel/centos 6 ships with python 2.6, doesnt it?
>>
>> if so, i still know plenty of large companies where python 2.6 is the
>> only option. asking them for python 2.7 is not going to work
>>
>> so i think its a bad idea
>>
>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>> juliet.hougl...@gmail.com> wrote:
>>
>>> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
>>> point, Python 3 should be the default that is encouraged.
>>> Most organizations acknowledge the 2.7 is common, but lagging behind the
>>> version they should theoretically use. Dropping python 2.6
>>> support sounds very reasonable to me.
>>>
>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 +1

 Red Hat supports Python 2.6 on REHL 5 until 2020
 ,
 but otherwise yes, Python 2.6 is ancient history and the core Python
 developers stopped supporting it in 2013. REHL 5 is not a good enough
 reason to continue support for Python 2.6 IMO.

 We should aim to support Python 2.7 and Python 3.3+ (which I believe we
 currently do).

 Nick

 On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
 wrote:

> plus 1,
>
> we are currently using python 2.7.2 in production environment.
>
>
>
>
>
> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>
> +1
> We use Python 2.7
>
> Regards,
>
> Meethu Mathew
>
> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
> wrote:
>
>> Does anybody here care about us dropping support for Python 2.6 in
>> Spark 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>> parsing) when compared with Python 2.7. Some libraries that Spark depend 
>> on
>> stopped supporting 2.6. We can still convince the library maintainers to
>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>> Python 2.6 to run Spark.
>>
>> Thanks.
>>
>>
>>
>
>>>
>>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
If users are able to install Spark 2.0 on their RHEL clusters, then I
imagine that they're also capable of installing a standalone Python
alongside that Spark version (without changing Python systemwide). For
instance, Anaconda/Miniconda make it really easy to install Python
2.7.x/3.x without impacting / changing the system Python and doesn't
require any special permissions to install (you don't need root / sudo
access). Does this address the Python versioning concerns for RHEL users?

On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:

> yeah, the practical concern is that we have no control over java or python
> version on large company clusters. our current reality for the vast
> majority of them is java 7 and python 2.6, no matter how outdated that is.
>
> i dont like it either, but i cannot change it.
>
> we currently don't use pyspark so i have no stake in this, but if we did i
> can assure you we would not upgrade to spark 2.x if python 2.6 was dropped.
> no point in developing something that doesnt run for majority of customers.
>
> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> As I pointed out in my earlier email, RHEL will support Python 2.6 until
>> 2020. So I'm assuming these large companies will have the option of riding
>> out Python 2.6 until then.
>>
>> Are we seriously saying that Spark should likewise support Python 2.6 for
>> the next several years? Even though the core Python devs stopped supporting
>> it in 2013?
>>
>> If that's not what we're suggesting, then when, roughly, can we drop
>> support? What are the criteria?
>>
>> I understand the practical concern here. If companies are stuck using
>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>> concern against the maintenance burden on this project, I would say that
>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>
>> I suppose if our main PySpark contributors are fine putting up with those
>> annoyances, then maybe we don't need to drop support just yet...
>>
>> Nick
>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
>> 작성:
>>
>>> Unfortunately, Koert is right.
>>>
>>> I've been in a couple of projects using Spark (banking industry) where
>>> CentOS + Python 2.6 is the toolbox available.
>>>
>>> That said, I believe it should not be a concern for Spark. Python 2.6 is
>>> old and busted, which is totally opposite to the Spark philosophy IMO.
>>>
>>>
>>> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>>>
>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>
>>> if so, i still know plenty of large companies where python 2.6 is the
>>> only option. asking them for python 2.7 is not going to work
>>>
>>> so i think its a bad idea
>>>
>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>>> juliet.hougl...@gmail.com> wrote:
>>>
 I don't see a reason Spark 2.0 would need to support Python 2.6. At
 this point, Python 3 should be the default that is encouraged.
 Most organizations acknowledge the 2.7 is common, but lagging behind
 the version they should theoretically use. Dropping python 2.6
 support sounds very reasonable to me.

 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020
> ,
> but otherwise yes, Python 2.6 is ancient history and the core Python
> developers stopped supporting it in 2013. REHL 5 is not a good enough
> reason to continue support for Python 2.6 IMO.
>
> We should aim to support Python 2.7 and Python 3.3+ (which I believe
> we currently do).
>
> Nick
>
> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
> wrote:
>
>> plus 1,
>>
>> we are currently using python 2.7.2 in production environment.
>>
>>
>>
>>
>>
>> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>>
>> +1
>> We use Python 2.7
>>
>> Regards,
>>
>> Meethu Mathew
>>
>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
>> wrote:
>>
>>> Does anybody here care about us dropping support for Python 2.6 in
>>> Spark 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark 
>>> depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still 
>>> uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>>
>>

>>>
>


How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
Hey,

For example, a table df with two columns
id  name
1   abc
1   bdf
2   ab
2   cd

I want to group by the id and concat the string into array of string. like
this

id
1 [abc,bdf]
2 [ab, cd]

How could I achieve this in dataframe?  I stuck on df.groupBy("id"). ???

Thanks


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
i do not think so.

does the python 2.7 need to be installed on all slaves? if so, we do not
have direct access to those.

also, spark is easy for us to ship with our software since its apache 2
licensed, and it only needs to be present on the machine that launches the
app (thanks to yarn).
even if python 2.7 was needed only on this one machine that launches the
app we can not ship it with our software because its gpl licensed, so the
client would have to download it and install it themselves, and this would
mean its an independent install which has to be audited and approved and
now you are in for a lot of fun. basically it will never happen.


On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen  wrote:

> If users are able to install Spark 2.0 on their RHEL clusters, then I
> imagine that they're also capable of installing a standalone Python
> alongside that Spark version (without changing Python systemwide). For
> instance, Anaconda/Miniconda make it really easy to install Python
> 2.7.x/3.x without impacting / changing the system Python and doesn't
> require any special permissions to install (you don't need root / sudo
> access). Does this address the Python versioning concerns for RHEL users?
>
> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>
>> yeah, the practical concern is that we have no control over java or
>> python version on large company clusters. our current reality for the vast
>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>
>> i dont like it either, but i cannot change it.
>>
>> we currently don't use pyspark so i have no stake in this, but if we did
>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>> dropped. no point in developing something that doesnt run for majority of
>> customers.
>>
>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> As I pointed out in my earlier email, RHEL will support Python 2.6 until
>>> 2020. So I'm assuming these large companies will have the option of riding
>>> out Python 2.6 until then.
>>>
>>> Are we seriously saying that Spark should likewise support Python 2.6
>>> for the next several years? Even though the core Python devs stopped
>>> supporting it in 2013?
>>>
>>> If that's not what we're suggesting, then when, roughly, can we drop
>>> support? What are the criteria?
>>>
>>> I understand the practical concern here. If companies are stuck using
>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>> concern against the maintenance burden on this project, I would say that
>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>>
>>> I suppose if our main PySpark contributors are fine putting up with
>>> those annoyances, then maybe we don't need to drop support just yet...
>>>
>>> Nick
>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
>>> 작성:
>>>
 Unfortunately, Koert is right.

 I've been in a couple of projects using Spark (banking industry) where
 CentOS + Python 2.6 is the toolbox available.

 That said, I believe it should not be a concern for Spark. Python 2.6
 is old and busted, which is totally opposite to the Spark philosophy IMO.


 El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:

 rhel/centos 6 ships with python 2.6, doesnt it?

 if so, i still know plenty of large companies where python 2.6 is the
 only option. asking them for python 2.7 is not going to work

 so i think its a bad idea

 On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
 juliet.hougl...@gmail.com> wrote:

> I don't see a reason Spark 2.0 would need to support Python 2.6. At
> this point, Python 3 should be the default that is encouraged.
> Most organizations acknowledge the 2.7 is common, but lagging behind
> the version they should theoretically use. Dropping python 2.6
> support sounds very reasonable to me.
>
> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> +1
>>
>> Red Hat supports Python 2.6 on REHL 5 until 2020
>> ,
>> but otherwise yes, Python 2.6 is ancient history and the core Python
>> developers stopped supporting it in 2013. REHL 5 is not a good enough
>> reason to continue support for Python 2.6 IMO.
>>
>> We should aim to support Python 2.7 and Python 3.3+ (which I believe
>> we currently do).
>>
>> Nick
>>
>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
>> wrote:
>>
>>> plus 1,
>>>
>>> we are currently using python 2.7.2 in production environment.
>>>
>>>
>>>
>>>
>>>
>>> 在 2016-01-05 

Re: Spark SQL dataframes explode /lateral view help

2016-01-05 Thread Deenar Toraskar
val sparkConf = new SparkConf()
  .setMaster("local[*]")
  .setAppName("Dataframe Test")

val sc = new SparkContext(sparkConf)
val sql = new SQLContext(sc)

val dataframe = sql.read.json("orders.json")

val expanded = dataframe
  .explode[::[Long], Long]("items", "item1")(row => row)
  .explode[::[Long], Long]("items", "item2")(row => row)

val grouped = expanded
  .where(expanded("item1") !== expanded("item2"))
  .groupBy("item1", "item2")
  .count()

val recs = grouped
  .groupBy("item1")

I found another example above, but I cant seem to figure out what this does?

val expanded = dataframe
  .explode[::[Long], Long]("items", "item1")(row => row)
  .explode[::[Long], Long]("items", "item2")(row => row)



On 5 January 2016 at 20:00, Deenar Toraskar 
wrote:

> Hi All
>
> I have the following spark sql query and would like to use convert this to
> use the dataframes api (spark 1.6). The eee, eep and pfep are all maps of
> (int -> float)
>
>
> select e.counterparty, epe, mpfe, eepe, noOfMonthseep, teee as
> effectiveExpectedExposure, teep as expectedExposure , tpfep as pfe
> |from exposureMeasuresCpty e
>   |  lateral view explode(eee) dummy1 as noOfMonthseee, teee
>   |  lateral view explode(eep) dummy2 as noOfMonthseep, teep
>   |  lateral view explode(pfep) dummy3 as noOfMonthspfep, tpfep
>   |where e.counterparty = '$cpty' and noOfMonthseee = noOfMonthseep and
> noOfMonthseee = noOfMonthspfep
>   |order by noOfMonthseep""".stripMargin
>
> Any guidance or samples would be appreciated. I have seen code snippets
> that handle arrays, but havent come across how to handle maps
>
> case class Book(title: String, words: String)
>val df: RDD[Book]
>
>case class Word(word: String)
>val allWords = df.explode('words) {
>  case Row(words: String) => words.split(" ").map(Word(_))
>}
>
>val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
>
>
> Regards
> Deenar
>


Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Umesh Kacha
Hi dataframe has not boolean option for coalesce it is only for RDD I
believe

sourceFrame.coalesce(1,true) //gives compilation error



On Wed, Jan 6, 2016 at 1:38 AM, Alexander Pivovarov 
wrote:

> try coalesce(1, true).
>
> On Tue, Jan 5, 2016 at 11:58 AM, unk1102  wrote:
>
>> hi I am trying to save many partitions of Dataframe into one CSV file and
>> it
>> take forever for large data sets of around 5-6 GB.
>>
>>
>> sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop")
>>
>> For small data above code works well but for large data it hangs forever
>> does not move on because of only one partitions has to shuffle data of GBs
>> please help me
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


RE: Spark on Apache Ingnite?

2016-01-05 Thread nate
We started playing with Ignite back Hadoop, hive and spark services, and
looking to move to it as our default for deployment going forward, still
early but so far its been pretty nice and excited for the flexibility it
will provide for our particular use cases.

Would say in general its worth looking into if your data workloads are:

a) mix of read/write, or heavy write at times
b) want write/read access to data from services/apps outside of your spark
workloads (old Hadoop jobs, custom apps, etc)
c) have strings of spark jobs that could benefit from caching your data
across them (think similar usage to tachyon)
d) you have sparksql queries that could benefit from indexing and mutability
(see pt (a) about mix read/write)

If your data is read exclusive and very batch oriented, and your workloads
are strictly spark based, benefits will be less and ignite would probably
act as more of a tachyon replacement as many of the other features outside
of RDD caching wont be leveraged.


-Original Message-
From: unk1102 [mailto:umesh.ka...@gmail.com] 
Sent: Tuesday, January 5, 2016 10:15 AM
To: user@spark.apache.org
Subject: Spark on Apache Ingnite?

Hi has anybody tried and had success with Spark on Apache Ignite seems
promising? https://ignite.apache.org/



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-
tp25884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-05 Thread Michael Armbrust
>
> I am trying to implement  org.apache.spark.ml.Transformer interface in
> Java 8.
>
My understanding is the sudo code for transformers is something like
>
> @Override
>
> public DataFrame transform(DataFrame df) {
>
> 1. Select the input column
>
> 2. Create a new column
>
> 3. Append the new column to the df argument and return
>
>}
>

The following line can be used inside of the transform function to return a
Dataframe that has been augmented with a new column using the stem lambda
function (defined as a UDF below).

return df.withColumn("filteredInput", expr("stem(rawInput)"));

This is producing a new column called filterInput (that is appended to
whatever columns are already there) by passing the column rawInput to your
arbitrary lambda function.


> Based on my experience the current DataFrame api is very limited. You can
> not apply a complicated lambda function. As a work around I convert the
> data frame to a JavaRDD, apply my complicated lambda, and then convert the
> resulting RDD back to a Data Frame.
>

This is exactly what this code is doing.  You are defining an arbitrary
lambda function as a UDF.  The difference here, when compared to a JavaRDD
map, is that you can use this UDF to append columns without having to
manually append the new data to some existing object.

sqlContext.udf().register("stem", new UDF1() {
  @Override
  public String call(String str) {
return // TODO: stemming code here
  }
}, DataTypes.StringType);

Now I select the “new column” from the Data Frame and try to call
> df.withColumn().
>
>
> I can try an implement this as a UDF. How ever I need to use several 3rd
> party jars. Any idea how insure the workers will have the required jar
> files? If I was submitting a normal java app I would create an uber jar
> will this work with UDFs?
>

Yeah, UDFs are run the same way as your RDD lambda functions.


Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Ted Yu
Which version of Spark do you use ?

This might be related:
https://issues.apache.org/jira/browse/SPARK-8560

Do you use dynamic allocation ?

Cheers

> On Jan 4, 2016, at 10:05 PM, Prasad Ravilla  wrote:
> 
> I am seeing negative active tasks in the Spark UI.
> 
> Is anyone seeing this?
> How is this possible?
> 
> Thanks,
> Prasad.
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Hello

I need to read an ORC files in hdfs in R using spark. I am not able to find
a package to do that.

Can anyone help with documentation or example for this purpose?

-- 
Architect
Infoworks.io
http://Infoworks.io


Re: finding distinct count using dataframe

2016-01-05 Thread Arunkumar Pillai
Thanks Yanbo,

Thanks for the help. But I'm not able to find countDistinct ot
approxCountDistinct. function. These functions are within dataframe or any
other package

On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang  wrote:

> Hi Arunkumar,
>
> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
> approxCountDistinct for a approximate result.
>
> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai :
>
>> Hi
>>
>> Is there any   functions to find distinct count of all the variables in
>> dataframe.
>>
>> val sc = new SparkContext(conf) // spark context
>> val options = Map("header" -> "true", "delimiter" -> delimiter, 
>> "inferSchema" -> "true")
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
>> val datasetDF = 
>> sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>>
>>
>> we are able to get the schema, variable data type. is there any method to 
>> get the distinct count ?
>>
>>
>>
>> --
>> Thanks and Regards
>> Arun
>>
>
>


-- 
Thanks and Regards
Arun


RE: Spark Streaming + Kafka + scala job message read issue

2016-01-05 Thread vivek.meghanathan
Hello All,

After investigating further using a test program, we were able to read the 
kafka input messages using spark streaming.

Once we add a particular line which performs map and reduce - and groupByKey 
(all written in single line), we are not seeing the input message details in 
the logs. We have increased the batch interval to 5 seconds and removed the 
numtasks (it was defined as 10) . Once we made this change the kafka messages 
started to get processed . But it takes long time to process.

This works fine in our local lab server but the problem in the google compute 
engine server. The local lab server is low in spec 8 cpu with 8GB ram but the 
cloud server is high memory one 30GB RAM and 8 CPU. As far as I could see the 
execution happens much faster in google platform but somehow the job processing 
getting messed up.

Any suggestions?


Regards,
Vivek M



From: Vivek Meghanathan (WT01 - NEP)
Sent: 27 December 2015 11:08
To: Bryan 
Cc: Vivek Meghanathan (WT01 - NEP) ; 
duc.was.h...@gmail.com; user@spark.apache.org
Subject: Re: Spark Streaming + Kafka + scala job message read issue


Hi Bryan,
Yes we are using only 1 thread per topic as we have only one Kafka server with 
1 partition.
What kind of logs will tell us what offset spark stream is reading from Kafka 
or is it resetting something without reading?

Regards
Vivek


Sent using CloudMagic 
Email
On Sun, Dec 27, 2015 at 12:03 am, Bryan 
> wrote:

Vivek,

Where you're using numThreads - look at the documentation for createStream. I 
believe that number should be the number of partitions to consume.

Sent from Outlook Mail for 
Windows 10 phone


From: vivek.meghanat...@wipro.com
Sent: Friday, December 25, 2015 11:39 PM
To: bryan.jeff...@gmail.com
Cc: duc.was.h...@gmail.com; 
vivek.meghanat...@wipro.com; 
user@spark.apache.org
Subject: Re: Spark Streaming + Kafka + scala job message read issue


Hi Brian,PhuDuc,

All 8 jobs are consuming 8 different IN topics. 8 different Scala jobs running 
each topic map mentioned below has only 1 thread number mentioned. In this case 
group should not be a problem right.

Here is the complete flow, spring MVC sends in messages to Kafka , spark 
streaming reading that and sends message back to Kafka, some cases they will 
update data to Cassandra only. Spring the response messages.
I could see the message is always reaching Kafka (checked through the console 
consumer).

Regards
Vivek


Sent using CloudMagic 
Email
On Sat, Dec 26, 2015 at 2:42 am, Bryan 
> wrote:

Agreed. I did not see that they were using the same group name.

Sent from Outlook Mail for 
Windows 10 phone


From: PhuDuc Nguyen
Sent: Friday, December 25, 2015 3:35 PM
To: vivek.meghanat...@wipro.com
Cc: user@spark.apache.org
Subject: Re: Spark Streaming + Kafka + scala job message read issue

Vivek,

Did you say you have 8 spark jobs that are consuming from the same topic and 
all jobs are using the same consumer group name? If so, each job would get a 
subset of messages from that kafka topic, ie each job would get 1 out of 8 
messages from that topic. Is that your intent?

regards,
Duc






On Thu, Dec 24, 2015 at 7:20 AM, 
> wrote:
We are using the older receiver based approach, the number of partitions is 1 
(we have a single node kafka) and we use single thread per topic still we have 
the problem. Please see the API we use. All 8 spark jobs use same group name - 
is that a problem?

val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap  - Number of 
threads used here is 1
val searches = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(line 
=> parse(line._2).extract[Search])


Regards,
Vivek M
From: Bryan [mailto:bryan.jeff...@gmail.com]
Sent: 24 December 2015 17:20
To: Vivek Meghanathan (WT01 - NEP) 
>; 
user@spark.apache.org
Subject: RE: Spark Streaming + Kafka + scala job message read issue

Are you using a direct stream consumer, or the older receiver based consumer? 
If the latter, do the number of partitions you've specified for your topic 
match the number of partitions in the topic on Kafka?

That would be an possible cause - as you might receive all data from a 

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Andy Davidson
Hi Unk1102

I also had trouble when I used coalesce(). Reparation() worked much better.
Keep in mind if you have a large number of portions you are probably going
have high communication costs.

Also my code works a lot better on 1.6.0. DataFrame memory was not be
spilled in 1.5.2. In 1.6.0 unpersist() actually frees up memory

Another strange thing I noticed in 1.5.1 was that I had thousands of
partitions. Many of them where empty. Have lots of empty partitions really
slowed things down

Andy

From:  unk1102 
Date:  Tuesday, January 5, 2016 at 11:58 AM
To:  "user @spark" 
Subject:  coalesce(1).saveAsTextfile() takes forever?

> hi I am trying to save many partitions of Dataframe into one CSV file and it
> take forever for large data sets of around 5-6 GB.
> 
> sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzi
> p").save("/path/hadoop")
> 
> For small data above code works well but for large data it hangs forever
> does not move on because of only one partitions has to shuffle data of GBs
> please help me
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-
> takes-forever-tp25886.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 




aggregateByKey vs combineByKey

2016-01-05 Thread Marco Mistroni
Hi all
 i have the following dataSet
kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s tring)]

It's a simple list of tuples containing (word_length, word)

What i wanted to do was to group the result by key in order to have a
result in the form

[(word_length_1, [word1, word2, word3], word_length_2, [word4, word5,
word6])

so i browsed spark API and was able to get the result i wanted using two
different
functions
.

scala> kv.combineByKey(List(_), (x:List[String], y:String) => y :: x,
(x:List[St

ring], y:List[String]) => x ::: y).collect()

res86: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi, am)),
(4,L
ist(test)), (6,List(string)))

and

scala>

scala> kv.aggregateByKey(List[String]())((acc, item) => item :: acc,

 |(acc1, acc2) => acc1 ::: acc2).collect()







res87: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi, am)),
(4,L
ist(test)), (6,List(string)))

Now, question is: any advantages of using one instead of the others?
Am i somehow misusing the API for what i want to do?

kind regards
 marco


Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
I'll do, but if you want my two cents, creating a dedicated "optimised"
encoder for Avro would be great (especially if it's possible to do better
than plain AvroKeyValueOutputFormat with saveAsNewAPIHadoopFile :) )

Thanks for your time Michael, and happy new year :-)

Regards,

Olivier.

2016-01-05 19:01 GMT+01:00 Michael Armbrust :

> You could try with the `Encoders.bean` method.  It detects classes that
> have getters and setters.  Please report back!
>
> On Tue, Jan 5, 2016 at 9:45 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Hi everyone,
>> considering the new Datasets API, will there be Encoders defined for
>> reading and writing Avro files ? Will it be possible to use already
>> generated Avro classes ?
>>
>> Regards,
>>
>> --
>> *Olivier Girardot*
>>
>
>


-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94


sortBy transformation shows as a job

2016-01-05 Thread Soumitra Kumar
Fellows,
I have a simple code.
sc.parallelize (Array (1, 4, 3, 2), 2).sortBy (i=>i).foreach (println)
This results in 2 jobs (sortBy, foreach) in Spark's application master ui.
I thought there is one to one relationship between RDD action and job. Here, 
only action is foreach, so should be only one job.
Please help me understand.
Thanks,-Soumitra.

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Michael Armbrust
On Tue, Jan 5, 2016 at 1:31 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> I'll do, but if you want my two cents, creating a dedicated "optimised"
> encoder for Avro would be great (especially if it's possible to do better
> than plain AvroKeyValueOutputFormat with saveAsNewAPIHadoopFile :) )
>

+1 and if not inside of spark, inside of something like
https://github.com/databricks/spark-avro once we open up the Encoder API.


Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Priya Ch
Can some one throw light on this ?

Regards,
Padma Ch

On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch 
wrote:

> Chris, we are using spark 1.3.0 version. we have not set  
> spark.streaming.concurrentJobs
> this parameter. It takes the default value.
>
> Vijay,
>
>   From the tack trace it is evident that 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
> is throwing the exception. I opened the spark source code and visited the
> line which is throwing this exception i.e
>
> [image: Inline image 1]
>
> The lie which is marked in red is throwing the exception. The file is
> ExternalSorter.scala in org.apache.spark.util.collection package.
>
> i went through the following blog
> http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
> and understood that there is merge factor which decide the number of
> on-disk files that could be merged. Is it some way related to this ?
>
> Regards,
> Padma CH
>
> On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly  wrote:
>
>> and which version of Spark/Spark Streaming are you using?
>>
>> are you explicitly setting the spark.streaming.concurrentJobs to
>> something larger than the default of 1?
>>
>> if so, please try setting that back to 1 and see if the problem still
>> exists.
>>
>> this is a dangerous parameter to modify from the default - which is why
>> it's not well-documented.
>>
>>
>> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge 
>> wrote:
>>
>>> Few indicators -
>>>
>>> 1) during execution time - check total number of open files using lsof
>>> command. Need root permissions. If it is cluster not sure much !
>>> 2) which exact line in the code is triggering this error ? Can you paste
>>> that snippet ?
>>>
>>>
>>> On Wednesday 23 December 2015, Priya Ch 
>>> wrote:
>>>
 ulimit -n 65000

 fs.file-max = 65000 ( in etc/sysctl.conf file)

 Thanks,
 Padma Ch

 On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:

> Could you share the ulimit for your setup please ?
>
> - Thanks, via mobile,  excuse brevity.
> On Dec 22, 2015 6:39 PM, "Priya Ch" 
> wrote:
>
>> Jakob,
>>
>>Increased the settings like fs.file-max in /etc/sysctl.conf and
>> also increased user limit in /etc/security/limits.conf. But still
>> see the same issue.
>>
>> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky 
>> wrote:
>>
>>> It might be a good idea to see how many files are open and try
>>> increasing the open file limit (this is done on an os level). In some
>>> application use-cases it is actually a legitimate need.
>>>
>>> If that doesn't help, make sure you close any unused files and
>>> streams in your code. It will also be easier to help diagnose the issue 
>>> if
>>> you send an error-reproducing snippet.
>>>
>>
>>

>>>
>>> --
>>> Regards,
>>> Vijay Gharge
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Chris Fregly*
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>>
>
>


Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Hi Sandeep
can you try this ?

results <- sql(hivecontext, "FROM test SELECT id","")

Thanks
Deepak


On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
wrote:

> Thanks Deepak.
>
> I tried this as well. I created a hivecontext   with  "hivecontext <<-
> sparkRHive.init(sc) "  .
>
> When I tried to read hive table from this ,
>
> results <- sql(hivecontext, "FROM test SELECT id")
>
> I get below error,
>
> Error in callJMethod(sqlContext, "sql", sqlQuery) :
>   Invalid jobj 2. If SparkR was restarted, Spark operations need to be 
> re-executed.
>
>
> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
> wrote:
>
>> Hi Sandeep
>> I am not sure if ORC can be read directly in R.
>> But there can be a workaround .First create hive table on top of ORC
>> files and then access hive table in R.
>>
>> Thanks
>> Deepak
>>
>> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
>> wrote:
>>
>>> Hello
>>>
>>> I need to read an ORC files in hdfs in R using spark. I am not able to
>>> find a package to do that.
>>>
>>> Can anyone help with documentation or example for this purpose?
>>>
>>> --
>>> Architect
>>> Infoworks.io
>>> http://Infoworks.io
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


problem building spark on centos

2016-01-05 Thread Jade Liu
Hi, All:

I'm trying to build spark 1.5.2 from source using maven with the following 
command:
./make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0 
-Dscala-2.11 -Phive -Phive-thriftserver -DskipTests

I got the following error:
+ VERSION='[ERROR] [Help 2] 
http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException'

When I try:
build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean 
package

I got the following error:
[FATAL] Non-resolvable parent POM for org.apache.spark:spark-parent_2.11:1.5.2: 
Could not transfer artifact org.apache:apache:pom:14 from/to central 
(https://repo1.maven.org/maven2): java.security.ProviderException: 
java.security.KeyException and 'parent.relativePath' points at wrong local POM 
@ line 22, column 11

Does anyone know how to change the settings in maven to fix this?

Thanks in advance,

Jade


pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread Wei Chen
Hi,

I am trying to retrieve the rows with a minimum value of a column for each
group. For example: the following dataframe:

a | b | c
--
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
2 | 1 | 4
2 | 2 | 5
2 | 3 | 6
3 | 1 | 7
3 | 2 | 8
3 | 3 | 9
--

I group by 'a', and want the rows with the smallest 'b', that is, I want to
return the following dataframe:

a | b | c
--
1 | 1 | 1
2 | 1 | 4
3 | 1 | 7
--

The dataframe I have is huge so get the minimum value of b from each group
and joining on the original dataframe is very expensive. Is there a better
way to do this?


Thanks,
Wei


Re: problem building spark on centos

2016-01-05 Thread Ted Yu
Which version of maven are you using ?

It should be 3.3.3+

On Tue, Jan 5, 2016 at 4:54 PM, Jade Liu  wrote:

> Hi, All:
>
> I’m trying to build spark 1.5.2 from source using maven with the following
> command:
>
> ./make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
> -Dscala-2.11 -Phive -Phive-thriftserver –DskipTests
>
>
> I got the following error:
>
> + VERSION='[ERROR] [Help 2]
> http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException
> '
>
>
> When I try:
>
> build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
> package
>
>
> I got the following error:
>
> [FATAL] Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.11:1.5.2: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
> java.security.ProviderException: java.security.KeyException and
> 'parent.relativePath' points at wrong local POM @ line 22, column 11
>
>
> Does anyone know how to change the settings in maven to fix this?
>
>
> Thanks in advance,
>
>
> Jade
>


DataFrame withColumnRenamed throwing NullPointerException

2016-01-05 Thread Prasad Ravilla
I am joining two data frames as shown in the code below. This is throwing 
NullPointerException.

I have a number of different join throughout the program and the SparkContext 
throws this NullPointerException on a randomly on one of the joins.
The two data frames are very large data frames ( around 1TB)

I am using Spark version 1.5.2.

Thanks in advance for any insights.

Regards,
Prasad.


Below is the code.

val userAndFmSegment = 
userData.as("userdata").join(fmSegmentData.withColumnRenamed("USER_ID", 
"FM_USER_ID").as("fmsegmentdata"),

$"userdata.PRIMARY_USER_ID" === $"fmsegmentdata.FM_USER_ID"

&& $"fmsegmentdata.END_DATE" >= date_sub($"userdata.REPORT_DATE", 
trailingWeeks * 7)

&& $"fmsegmentdata.START_DATE" <= date_sub($"userdata.REPORT_DATE", 
trailingWeeks * 7)

, "inner").select(

"USER_ID",

"PRIMARY_USER_ID",

"FM_BUYER_TYPE_CD"

)





Log


16/01/05 17:41:19 ERROR ApplicationMaster: User class threw exception: 
java.lang.NullPointerException

java.lang.NullPointerException

at org.apache.spark.sql.DataFrame.withColumnRenamed(DataFrame.scala:1161)

at DnaAgg$.getUserIdAndFMSegmentId$1(DnaAgg.scala:294)

at DnaAgg$.main(DnaAgg.scala:339)

at DnaAgg.main(DnaAgg.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)





UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Soumitra Johri
Hi,

I am relatively new to Spark and am using updateStateByKey() operation to
maintain state in my Spark Streaming application. The input data is coming
through a Kafka topic.

   1. I want to understand how are DStreams partitioned?
   2. How does the partitioning work with mapWithState() or
   updateStatebyKey() method?
   3. In updateStateByKey() does the old state and the new values against a
   given key processed on same node ?
   4. How frequent is the shuffle for updateStateByKey() method ?

The state I have to maintaining contains ~ 10 keys and I want to avoid
shuffle every time I update the state , any tips to do it ?

Warm Regards
Soumitra


Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
Right, I can override the root pool in configuration file, Thanks Mark.

On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra 
wrote:

> Just configure  with
> FAIR in fairscheduler.xml (or
> in spark.scheduler.allocation.file if you have over-riden the default name
> for the config file.)  `buildDefaultPool()` will only build the pool named
> "default" with the default properties (such as schedulingMode =
> DEFAULT_SCHEDULING_MODE -- i.e. SchedulingMode.FAIR) if that pool name is
> not already built (
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L90
> ).
>
>
> On Tue, Jan 5, 2016 at 4:15 PM, Jeff Zhang  wrote:
>
>> Sorry, I don't make it clearly. What I want is the default pool is fair
>> scheduling. But seems if I want to use fair scheduling now, I have to set
>> spark.scheduler.pool explicitly.
>>
>> On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra 
>> wrote:
>>
>>> I don't understand.  If you're using fair scheduling and don't set a
>>> pool, the default pool will be used.
>>>
>>> On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang  wrote:
>>>

 It seems currently spark.scheduler.pool must be set as localProperties
 (associate with thread). Any reason why spark.scheduler.pool can not be
 used globally.  My scenario is that I want my thriftserver started with
 fair scheduler as the default pool without using set command to set the
 pool. Is there anyway to do that ? Or do I miss anything here ?

 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jeff Zhang
+1

On Wed, Jan 6, 2016 at 9:18 AM, Juliet Hougland 
wrote:

> Most admins I talk to about python and spark are already actively (or on
> their way to) managing their cluster python installations. Even if people
> begin using the system python with pyspark, there is eventually a user who
> needs a complex dependency (like pandas or sklearn) on the cluster. No
> admin would muck around installing libs into system python, so you end up
> with other python installations.
>
> Installing a non-system python is something users intending to use pyspark
> on a real cluster should be thinking about, eventually, anyway. It would
> work in situations where people are running pyspark locally or actively
> managing python installations on a cluster. There is an awkward middle
> point where someone has installed spark but not configured their cluster
> (by installing non default python) in any other way. Most clusters I see
> are RHEL/CentOS and have something other than system python used by spark.
>
> What libraries stopped supporting python 2.6 and where does spark use
> them? The "ease of transitioning to pyspark onto a cluster" problem may be
> an easier pill to swallow if it only affected something like mllib or spark
> sql and not parts of the core api. You end up hoping numpy or pandas are
> installed in the runtime components of spark anyway. At that point people
> really should just go install a non system python. There are tradeoffs to
> using pyspark and I feel pretty fine explaining to people that managing
> their cluster's python installations is something that comes with using
> pyspark.
>
> RHEL/CentOS is so common that this would probably be a little work for a
> lot of people.
>
> --Juliet
>
> On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers  wrote:
>
>> hey evil admin:)
>> i think the bit about java was from me?
>> if so, i meant to indicate that the reality for us is java is 1.7 on most
>> (all?) clusters. i do not believe spark prefers java 1.8. my point was that
>> even although java 1.7 is getting old as well it would be a major issue for
>> me if spark dropped java 1.7 support.
>>
>> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken 
>> wrote:
>>
>>> As one of the evil administrators that runs a RHEL 6 cluster, we already
>>> provide quite a few different version of python on our cluster pretty darn
>>> easily. All you need is a separate install directory and to set the
>>> PYTHON_HOME environment variable to point to the correct python, then have
>>> the users make sure the correct python is in their PATH. I understand that
>>> other administrators may not be so compliant.
>>>
>>> Saw a small bit about the java version in there; does Spark currently
>>> prefer Java 1.8.x?
>>>
>>> —Ken
>>>
>>> On Jan 5, 2016, at 6:08 PM, Josh Rosen  wrote:
>>>
>>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
 while continuing to use a vanilla `python` executable on the executors
>>>
>>>
>>> Whoops, just to be clear, this should actually read "while continuing to
>>> use a vanilla `python` 2.7 executable".
>>>
>>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen 
>>> wrote:
>>>
 Yep, the driver and executors need to have compatible Python versions.
 I think that there are some bytecode-level incompatibilities between 2.6
 and 2.7 which would impact the deserialization of Python closures, so I
 think you need to be running the same 2.x version for all communicating
 Spark processes. Note that you _can_ use a Python 2.7 `ipython` executable
 on the driver while continuing to use a vanilla `python` executable on the
 executors (we have environment variables which allow you to control these
 separately).

 On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> I think all the slaves need the same (or a compatible) version of
> Python installed since they run Python code in PySpark jobs natively.
>
> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers 
> wrote:
>
>> interesting i didnt know that!
>>
>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> even if python 2.7 was needed only on this one machine that launches
>>> the app we can not ship it with our software because its gpl licensed
>>>
>>> Not to nitpick, but maybe this is important. The Python license is 
>>> GPL-compatible
>>> but not GPL :
>>>
>>> Note GPL-compatible doesn’t mean that we’re distributing Python
>>> under the GPL. All Python licenses, unlike the GPL, let you distribute a
>>> modified version without making your changes open source. The
>>> GPL-compatible licenses make it possible to combine Python with other
>>> software that is 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
hey evil admin:)
i think the bit about java was from me?
if so, i meant to indicate that the reality for us is java is 1.7 on most
(all?) clusters. i do not believe spark prefers java 1.8. my point was that
even although java 1.7 is getting old as well it would be a major issue for
me if spark dropped java 1.7 support.

On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken 
wrote:

> As one of the evil administrators that runs a RHEL 6 cluster, we already
> provide quite a few different version of python on our cluster pretty darn
> easily. All you need is a separate install directory and to set the
> PYTHON_HOME environment variable to point to the correct python, then have
> the users make sure the correct python is in their PATH. I understand that
> other administrators may not be so compliant.
>
> Saw a small bit about the java version in there; does Spark currently
> prefer Java 1.8.x?
>
> —Ken
>
> On Jan 5, 2016, at 6:08 PM, Josh Rosen  wrote:
>
> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
>> while continuing to use a vanilla `python` executable on the executors
>
>
> Whoops, just to be clear, this should actually read "while continuing to
> use a vanilla `python` 2.7 executable".
>
> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen 
> wrote:
>
>> Yep, the driver and executors need to have compatible Python versions. I
>> think that there are some bytecode-level incompatibilities between 2.6 and
>> 2.7 which would impact the deserialization of Python closures, so I think
>> you need to be running the same 2.x version for all communicating Spark
>> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
>> driver while continuing to use a vanilla `python` executable on the
>> executors (we have environment variables which allow you to control these
>> separately).
>>
>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I think all the slaves need the same (or a compatible) version of Python
>>> installed since they run Python code in PySpark jobs natively.
>>>
>>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:
>>>
 interesting i didnt know that!

 On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> even if python 2.7 was needed only on this one machine that launches
> the app we can not ship it with our software because its gpl licensed
>
> Not to nitpick, but maybe this is important. The Python license is 
> GPL-compatible
> but not GPL :
>
> Note GPL-compatible doesn’t mean that we’re distributing Python under
> the GPL. All Python licenses, unlike the GPL, let you distribute a 
> modified
> version without making your changes open source. The GPL-compatible
> licenses make it possible to combine Python with other software that is
> released under the GPL; the others don’t.
>
> Nick
> ​
>
> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers 
> wrote:
>
>> i do not think so.
>>
>> does the python 2.7 need to be installed on all slaves? if so, we do
>> not have direct access to those.
>>
>> also, spark is easy for us to ship with our software since its apache
>> 2 licensed, and it only needs to be present on the machine that launches
>> the app (thanks to yarn).
>> even if python 2.7 was needed only on this one machine that launches
>> the app we can not ship it with our software because its gpl licensed, so
>> the client would have to download it and install it themselves, and this
>> would mean its an independent install which has to be audited and 
>> approved
>> and now you are in for a lot of fun. basically it will never happen.
>>
>>
>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
>> wrote:
>>
>>> If users are able to install Spark 2.0 on their RHEL clusters, then
>>> I imagine that they're also capable of installing a standalone Python
>>> alongside that Spark version (without changing Python systemwide). For
>>> instance, Anaconda/Miniconda make it really easy to install Python
>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>> require any special permissions to install (you don't need root / sudo
>>> access). Does this address the Python versioning concerns for RHEL 
>>> users?
>>>
>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
>>> wrote:
>>>
 yeah, the practical concern is that we have no control over java or
 python version on large company clusters. our current reality for the 
 vast
 majority of them is java 7 and python 2.6, no matter how outdated that 
 is.

 i dont 

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
Sorry, I don't make it clearly. What I want is the default pool is fair
scheduling. But seems if I want to use fair scheduling now, I have to set
spark.scheduler.pool explicitly.

On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra 
wrote:

> I don't understand.  If you're using fair scheduling and don't set a pool,
> the default pool will be used.
>
> On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang  wrote:
>
>>
>> It seems currently spark.scheduler.pool must be set as localProperties
>> (associate with thread). Any reason why spark.scheduler.pool can not be
>> used globally.  My scenario is that I want my thriftserver started with
>> fair scheduler as the default pool without using set command to set the
>> pool. Is there anyway to do that ? Or do I miss anything here ?
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
I don't think that we're planning to drop Java 7 support for Spark 2.0.

Personally, I would recommend using Java 8 if you're running Spark 1.5.0+
and are using SQL/DataFrames so that you can benefit from improvements to
code cache flushing in the Java 8 JVMs. Spark SQL's generated classes can
fill up the JVM's code cache, which causes JIT to stop working for new
bytecode. Empirically, it looks like the Java 8 JVMs have an improved
ability to flush this code cache, thereby avoiding this problem.

TL;DR: I'd prefer to run Java 8 with Spark if given the choice.

On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers  wrote:

> hey evil admin:)
> i think the bit about java was from me?
> if so, i meant to indicate that the reality for us is java is 1.7 on most
> (all?) clusters. i do not believe spark prefers java 1.8. my point was that
> even although java 1.7 is getting old as well it would be a major issue for
> me if spark dropped java 1.7 support.
>
> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken 
> wrote:
>
>> As one of the evil administrators that runs a RHEL 6 cluster, we already
>> provide quite a few different version of python on our cluster pretty darn
>> easily. All you need is a separate install directory and to set the
>> PYTHON_HOME environment variable to point to the correct python, then have
>> the users make sure the correct python is in their PATH. I understand that
>> other administrators may not be so compliant.
>>
>> Saw a small bit about the java version in there; does Spark currently
>> prefer Java 1.8.x?
>>
>> —Ken
>>
>> On Jan 5, 2016, at 6:08 PM, Josh Rosen  wrote:
>>
>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
>>> while continuing to use a vanilla `python` executable on the executors
>>
>>
>> Whoops, just to be clear, this should actually read "while continuing to
>> use a vanilla `python` 2.7 executable".
>>
>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen 
>> wrote:
>>
>>> Yep, the driver and executors need to have compatible Python versions. I
>>> think that there are some bytecode-level incompatibilities between 2.6 and
>>> 2.7 which would impact the deserialization of Python closures, so I think
>>> you need to be running the same 2.x version for all communicating Spark
>>> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
>>> driver while continuing to use a vanilla `python` executable on the
>>> executors (we have environment variables which allow you to control these
>>> separately).
>>>
>>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 I think all the slaves need the same (or a compatible) version of
 Python installed since they run Python code in PySpark jobs natively.

 On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches
>> the app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL :
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under
>> the GPL. All Python licenses, unlike the GPL, let you distribute a 
>> modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> ​
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers 
>> wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do
>>> not have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its
>>> apache 2 licensed, and it only needs to be present on the machine that
>>> launches the app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches
>>> the app we can not ship it with our software because its gpl licensed, 
>>> so
>>> the client would have to download it and install it themselves, and this
>>> would mean its an independent install which has to be audited and 
>>> approved
>>> and now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen >> > wrote:
>>>
 If users are able to install Spark 2.0 on their RHEL clusters, then
 I imagine that they're also capable of installing a standalone Python
 alongside that Spark version 

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
Just configure  with
FAIR in fairscheduler.xml (or
in spark.scheduler.allocation.file if you have over-riden the default name
for the config file.)  `buildDefaultPool()` will only build the pool named
"default" with the default properties (such as schedulingMode =
DEFAULT_SCHEDULING_MODE -- i.e. SchedulingMode.FAIR) if that pool name is
not already built (
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L90
).


On Tue, Jan 5, 2016 at 4:15 PM, Jeff Zhang  wrote:

> Sorry, I don't make it clearly. What I want is the default pool is fair
> scheduling. But seems if I want to use fair scheduling now, I have to set
> spark.scheduler.pool explicitly.
>
> On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra 
> wrote:
>
>> I don't understand.  If you're using fair scheduling and don't set a
>> pool, the default pool will be used.
>>
>> On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang  wrote:
>>
>>>
>>> It seems currently spark.scheduler.pool must be set as localProperties
>>> (associate with thread). Any reason why spark.scheduler.pool can not be
>>> used globally.  My scenario is that I want my thriftserver started with
>>> fair scheduler as the default pool without using set command to set the
>>> pool. Is there anyway to do that ? Or do I miss anything here ?
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Spark SQL dataframes explode /lateral view help

2016-01-05 Thread Deenar Toraskar
Hi All

I have the following spark sql query and would like to use convert this to
use the dataframes api (spark 1.6). The eee, eep and pfep are all maps of
(int -> float)


select e.counterparty, epe, mpfe, eepe, noOfMonthseep, teee as
effectiveExpectedExposure, teep as expectedExposure , tpfep as pfe
|from exposureMeasuresCpty e
  |  lateral view explode(eee) dummy1 as noOfMonthseee, teee
  |  lateral view explode(eep) dummy2 as noOfMonthseep, teep
  |  lateral view explode(pfep) dummy3 as noOfMonthspfep, tpfep
  |where e.counterparty = '$cpty' and noOfMonthseee = noOfMonthseep and
noOfMonthseee = noOfMonthspfep
  |order by noOfMonthseep""".stripMargin

Any guidance or samples would be appreciated. I have seen code snippets
that handle arrays, but havent come across how to handle maps

case class Book(title: String, words: String)
   val df: RDD[Book]

   case class Word(word: String)
   val allWords = df.explode('words) {
 case Row(words: String) => words.split(" ").map(Word(_))
   }

   val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))


Regards
Deenar


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661

On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers  wrote:
> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache 2
> licensed, and it only needs to be present on the machine that launches the
> app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches the app
> we can not ship it with our software because its gpl licensed, so the client
> would have to download it and install it themselves, and this would mean its
> an independent install which has to be audited and approved and now you are
> in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen  wrote:
>>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python 2.7.x/3.x
>> without impacting / changing the system Python and doesn't require any
>> special permissions to install (you don't need root / sudo access). Does
>> this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we did
>>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas
>>>  wrote:

 As I pointed out in my earlier email, RHEL will support Python 2.6 until
 2020. So I'm assuming these large companies will have the option of riding
 out Python 2.6 until then.

 Are we seriously saying that Spark should likewise support Python 2.6
 for the next several years? Even though the core Python devs stopped
 supporting it in 2013?

 If that's not what we're suggesting, then when, roughly, can we drop
 support? What are the criteria?

 I understand the practical concern here. If companies are stuck using
 2.6, it doesn't matter to them that it is deprecated. But balancing that
 concern against the maintenance burden on this project, I would say that
 "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
 take. There are many tiny annoyances one has to put up with to support 2.6.

 I suppose if our main PySpark contributors are fine putting up with
 those annoyances, then maybe we don't need to drop support just yet...

 Nick
 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente
 님이 작성:
>
> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6
> is old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the
> only option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland
>  wrote:
>>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At
>> this point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind
>> the version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas
>>  wrote:
>>>
>>> +1
>>>
>>> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes,
>>> Python 2.6 is ancient history and the core Python developers stopped
>>> supporting it in 2013. REHL 5 is not a good enough reason to continue
>>> support for Python 2.6 IMO.
>>>
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe
>>> we currently do).
>>>
>>> Nick
>>>
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
interesting i didnt know that!

On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas  wrote:

> even if python 2.7 was needed only on this one machine that launches the
> app we can not ship it with our software because its gpl licensed
>
> Not to nitpick, but maybe this is important. The Python license is 
> GPL-compatible
> but not GPL :
>
> Note GPL-compatible doesn’t mean that we’re distributing Python under the
> GPL. All Python licenses, unlike the GPL, let you distribute a modified
> version without making your changes open source. The GPL-compatible
> licenses make it possible to combine Python with other software that is
> released under the GPL; the others don’t.
>
> Nick
> ​
>
> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:
>
>> i do not think so.
>>
>> does the python 2.7 need to be installed on all slaves? if so, we do not
>> have direct access to those.
>>
>> also, spark is easy for us to ship with our software since its apache 2
>> licensed, and it only needs to be present on the machine that launches the
>> app (thanks to yarn).
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed, so the
>> client would have to download it and install it themselves, and this would
>> mean its an independent install which has to be audited and approved and
>> now you are in for a lot of fun. basically it will never happen.
>>
>>
>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
>> wrote:
>>
>>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>>> imagine that they're also capable of installing a standalone Python
>>> alongside that Spark version (without changing Python systemwide). For
>>> instance, Anaconda/Miniconda make it really easy to install Python
>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>> require any special permissions to install (you don't need root / sudo
>>> access). Does this address the Python versioning concerns for RHEL users?
>>>
>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>>>
 yeah, the practical concern is that we have no control over java or
 python version on large company clusters. our current reality for the vast
 majority of them is java 7 and python 2.6, no matter how outdated that is.

 i dont like it either, but i cannot change it.

 we currently don't use pyspark so i have no stake in this, but if we
 did i can assure you we would not upgrade to spark 2.x if python 2.6 was
 dropped. no point in developing something that doesnt run for majority of
 customers.

 On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> As I pointed out in my earlier email, RHEL will support Python 2.6
> until 2020. So I'm assuming these large companies will have the option of
> riding out Python 2.6 until then.
>
> Are we seriously saying that Spark should likewise support Python 2.6
> for the next several years? Even though the core Python devs stopped
> supporting it in 2013?
>
> If that's not what we're suggesting, then when, roughly, can we drop
> support? What are the criteria?
>
> I understand the practical concern here. If companies are stuck using
> 2.6, it doesn't matter to them that it is deprecated. But balancing that
> concern against the maintenance burden on this project, I would say that
> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
> take. There are many tiny annoyances one has to put up with to support 
> 2.6.
>
> I suppose if our main PySpark contributors are fine putting up with
> those annoyances, then maybe we don't need to drop support just yet...
>
> Nick
> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
> 작성:
>
>> Unfortunately, Koert is right.
>>
>> I've been in a couple of projects using Spark (banking industry)
>> where CentOS + Python 2.6 is the toolbox available.
>>
>> That said, I believe it should not be a concern for Spark. Python 2.6
>> is old and busted, which is totally opposite to the Spark philosophy IMO.
>>
>>
>> El 5 ene 2016, a las 20:07, Koert Kuipers 
>> escribió:
>>
>> rhel/centos 6 ships with python 2.6, doesnt it?
>>
>> if so, i still know plenty of large companies where python 2.6 is the
>> only option. asking them for python 2.7 is not going to work
>>
>> so i think its a bad idea
>>
>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>> juliet.hougl...@gmail.com> wrote:
>>
>>> I don't see a reason Spark 2.0 would need to support Python 2.6. At
>>> this point, Python 3 should be the default that is 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
I think all the slaves need the same (or a compatible) version of Python
installed since they run Python code in PySpark jobs natively.

On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL :
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under the
>> GPL. All Python licenses, unlike the GPL, let you distribute a modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> ​
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do not
>>> have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its apache 2
>>> licensed, and it only needs to be present on the machine that launches the
>>> app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed, so the
>>> client would have to download it and install it themselves, and this would
>>> mean its an independent install which has to be audited and approved and
>>> now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
>>> wrote:
>>>
 If users are able to install Spark 2.0 on their RHEL clusters, then I
 imagine that they're also capable of installing a standalone Python
 alongside that Spark version (without changing Python systemwide). For
 instance, Anaconda/Miniconda make it really easy to install Python
 2.7.x/3.x without impacting / changing the system Python and doesn't
 require any special permissions to install (you don't need root / sudo
 access). Does this address the Python versioning concerns for RHEL users?

 On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
 wrote:

> yeah, the practical concern is that we have no control over java or
> python version on large company clusters. our current reality for the vast
> majority of them is java 7 and python 2.6, no matter how outdated that is.
>
> i dont like it either, but i cannot change it.
>
> we currently don't use pyspark so i have no stake in this, but if we
> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
> dropped. no point in developing something that doesnt run for majority of
> customers.
>
> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> As I pointed out in my earlier email, RHEL will support Python 2.6
>> until 2020. So I'm assuming these large companies will have the option of
>> riding out Python 2.6 until then.
>>
>> Are we seriously saying that Spark should likewise support Python 2.6
>> for the next several years? Even though the core Python devs stopped
>> supporting it in 2013?
>>
>> If that's not what we're suggesting, then when, roughly, can we drop
>> support? What are the criteria?
>>
>> I understand the practical concern here. If companies are stuck using
>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>> concern against the maintenance burden on this project, I would say that
>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position 
>> to
>> take. There are many tiny annoyances one has to put up with to support 
>> 2.6.
>>
>> I suppose if our main PySpark contributors are fine putting up with
>> those annoyances, then maybe we don't need to drop support just yet...
>>
>> Nick
>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <
>> ju...@esbet.es>님이 작성:
>>
>>> Unfortunately, Koert is right.
>>>
>>> I've been in a couple of projects using Spark (banking industry)
>>> where CentOS + Python 2.6 is the toolbox available.
>>>
>>> That said, I believe it should not be a concern for Spark. Python
>>> 2.6 is old and busted, which is totally opposite to the Spark philosophy
>>> IMO.
>>>
>>>
>>> El 5 ene 2016, a las 20:07, Koert Kuipers 
>>> escribió:
>>>
>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>
>>> if so, i still know plenty of large companies where python 2.6 is
>>> the only option. 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
>
> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
> while continuing to use a vanilla `python` executable on the executors


Whoops, just to be clear, this should actually read "while continuing to
use a vanilla `python` 2.7 executable".

On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen  wrote:

> Yep, the driver and executors need to have compatible Python versions. I
> think that there are some bytecode-level incompatibilities between 2.6 and
> 2.7 which would impact the deserialization of Python closures, so I think
> you need to be running the same 2.x version for all communicating Spark
> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
> driver while continuing to use a vanilla `python` executable on the
> executors (we have environment variables which allow you to control these
> separately).
>
> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I think all the slaves need the same (or a compatible) version of Python
>> installed since they run Python code in PySpark jobs natively.
>>
>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:
>>
>>> interesting i didnt know that!
>>>
>>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 even if python 2.7 was needed only on this one machine that launches
 the app we can not ship it with our software because its gpl licensed

 Not to nitpick, but maybe this is important. The Python license is 
 GPL-compatible
 but not GPL :

 Note GPL-compatible doesn’t mean that we’re distributing Python under
 the GPL. All Python licenses, unlike the GPL, let you distribute a modified
 version without making your changes open source. The GPL-compatible
 licenses make it possible to combine Python with other software that is
 released under the GPL; the others don’t.

 Nick
 ​

 On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:

> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do
> not have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache
> 2 licensed, and it only needs to be present on the machine that launches
> the app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches
> the app we can not ship it with our software because its gpl licensed, so
> the client would have to download it and install it themselves, and this
> would mean its an independent install which has to be audited and approved
> and now you are in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
> wrote:
>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python
>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>> require any special permissions to install (you don't need root / sudo
>> access). Does this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
>> wrote:
>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the 
>>> vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that 
>>> is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we
>>> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority 
>>> of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 As I pointed out in my earlier email, RHEL will support Python 2.6
 until 2020. So I'm assuming these large companies will have the option 
 of
 riding out Python 2.6 until then.

 Are we seriously saying that Spark should likewise support Python
 2.6 for the next several years? Even though the core Python devs 
 stopped
 supporting it in 2013?

 If that's not what we're suggesting, then when, roughly, can we
 drop support? What are the criteria?

 I understand the practical concern here. If companies are stuck
 using 2.6, it doesn't matter to them that it is 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
even if python 2.7 was needed only on this one machine that launches the
app we can not ship it with our software because its gpl licensed

Not to nitpick, but maybe this is important. The Python license is
GPL-compatible
but not GPL :

Note GPL-compatible doesn’t mean that we’re distributing Python under the
GPL. All Python licenses, unlike the GPL, let you distribute a modified
version without making your changes open source. The GPL-compatible
licenses make it possible to combine Python with other software that is
released under the GPL; the others don’t.

Nick
​

On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:

> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache 2
> licensed, and it only needs to be present on the machine that launches the
> app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches the
> app we can not ship it with our software because its gpl licensed, so the
> client would have to download it and install it themselves, and this would
> mean its an independent install which has to be audited and approved and
> now you are in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
> wrote:
>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python
>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>> require any special permissions to install (you don't need root / sudo
>> access). Does this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we did
>>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 As I pointed out in my earlier email, RHEL will support Python 2.6
 until 2020. So I'm assuming these large companies will have the option of
 riding out Python 2.6 until then.

 Are we seriously saying that Spark should likewise support Python 2.6
 for the next several years? Even though the core Python devs stopped
 supporting it in 2013?

 If that's not what we're suggesting, then when, roughly, can we drop
 support? What are the criteria?

 I understand the practical concern here. If companies are stuck using
 2.6, it doesn't matter to them that it is deprecated. But balancing that
 concern against the maintenance burden on this project, I would say that
 "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
 take. There are many tiny annoyances one has to put up with to support 2.6.

 I suppose if our main PySpark contributors are fine putting up with
 those annoyances, then maybe we don't need to drop support just yet...

 Nick
 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
 작성:

> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6
> is old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers 
> escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the
> only option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
> juliet.hougl...@gmail.com> wrote:
>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At
>> this point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind
>> the version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, 

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Ted Yu
Something like the following:

val zeroValue = collection.mutable.Set[String]()

val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v,
(setOne, setTwo) => setOne ++= setTwo)

On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue  wrote:

> Hey,
>
> For example, a table df with two columns
> id  name
> 1   abc
> 1   bdf
> 2   ab
> 2   cd
>
> I want to group by the id and concat the string into array of string. like
> this
>
> id
> 1 [abc,bdf]
> 2 [ab, cd]
>
> How could I achieve this in dataframe?  I stuck on df.groupBy("id"). ???
>
> Thanks
>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
if python 2.7 only has to be present on the node that launches the app
(does it?) than that could be important indeed.

On Tue, Jan 5, 2016 at 6:02 PM, Koert Kuipers  wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL :
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under the
>> GPL. All Python licenses, unlike the GPL, let you distribute a modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> ​
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do not
>>> have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its apache 2
>>> licensed, and it only needs to be present on the machine that launches the
>>> app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed, so the
>>> client would have to download it and install it themselves, and this would
>>> mean its an independent install which has to be audited and approved and
>>> now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
>>> wrote:
>>>
 If users are able to install Spark 2.0 on their RHEL clusters, then I
 imagine that they're also capable of installing a standalone Python
 alongside that Spark version (without changing Python systemwide). For
 instance, Anaconda/Miniconda make it really easy to install Python
 2.7.x/3.x without impacting / changing the system Python and doesn't
 require any special permissions to install (you don't need root / sudo
 access). Does this address the Python versioning concerns for RHEL users?

 On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
 wrote:

> yeah, the practical concern is that we have no control over java or
> python version on large company clusters. our current reality for the vast
> majority of them is java 7 and python 2.6, no matter how outdated that is.
>
> i dont like it either, but i cannot change it.
>
> we currently don't use pyspark so i have no stake in this, but if we
> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
> dropped. no point in developing something that doesnt run for majority of
> customers.
>
> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> As I pointed out in my earlier email, RHEL will support Python 2.6
>> until 2020. So I'm assuming these large companies will have the option of
>> riding out Python 2.6 until then.
>>
>> Are we seriously saying that Spark should likewise support Python 2.6
>> for the next several years? Even though the core Python devs stopped
>> supporting it in 2013?
>>
>> If that's not what we're suggesting, then when, roughly, can we drop
>> support? What are the criteria?
>>
>> I understand the practical concern here. If companies are stuck using
>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>> concern against the maintenance burden on this project, I would say that
>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position 
>> to
>> take. There are many tiny annoyances one has to put up with to support 
>> 2.6.
>>
>> I suppose if our main PySpark contributors are fine putting up with
>> those annoyances, then maybe we don't need to drop support just yet...
>>
>> Nick
>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <
>> ju...@esbet.es>님이 작성:
>>
>>> Unfortunately, Koert is right.
>>>
>>> I've been in a couple of projects using Spark (banking industry)
>>> where CentOS + Python 2.6 is the toolbox available.
>>>
>>> That said, I believe it should not be a concern for Spark. Python
>>> 2.6 is old and busted, which is totally opposite to the Spark philosophy
>>> IMO.
>>>
>>>
>>> El 5 ene 2016, a las 20:07, Koert Kuipers 
>>> escribió:
>>>
>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>
>>> if so, i still know plenty of large companies where python 2.6 is
>>> the only option. asking them for 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
Yep, the driver and executors need to have compatible Python versions. I
think that there are some bytecode-level incompatibilities between 2.6 and
2.7 which would impact the deserialization of Python closures, so I think
you need to be running the same 2.x version for all communicating Spark
processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
driver while continuing to use a vanilla `python` executable on the
executors (we have environment variables which allow you to control these
separately).

On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas  wrote:

> I think all the slaves need the same (or a compatible) version of Python
> installed since they run Python code in PySpark jobs natively.
>
> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:
>
>> interesting i didnt know that!
>>
>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed
>>>
>>> Not to nitpick, but maybe this is important. The Python license is 
>>> GPL-compatible
>>> but not GPL :
>>>
>>> Note GPL-compatible doesn’t mean that we’re distributing Python under
>>> the GPL. All Python licenses, unlike the GPL, let you distribute a modified
>>> version without making your changes open source. The GPL-compatible
>>> licenses make it possible to combine Python with other software that is
>>> released under the GPL; the others don’t.
>>>
>>> Nick
>>> ​
>>>
>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:
>>>
 i do not think so.

 does the python 2.7 need to be installed on all slaves? if so, we do
 not have direct access to those.

 also, spark is easy for us to ship with our software since its apache 2
 licensed, and it only needs to be present on the machine that launches the
 app (thanks to yarn).
 even if python 2.7 was needed only on this one machine that launches
 the app we can not ship it with our software because its gpl licensed, so
 the client would have to download it and install it themselves, and this
 would mean its an independent install which has to be audited and approved
 and now you are in for a lot of fun. basically it will never happen.


 On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
 wrote:

> If users are able to install Spark 2.0 on their RHEL clusters, then I
> imagine that they're also capable of installing a standalone Python
> alongside that Spark version (without changing Python systemwide). For
> instance, Anaconda/Miniconda make it really easy to install Python
> 2.7.x/3.x without impacting / changing the system Python and doesn't
> require any special permissions to install (you don't need root / sudo
> access). Does this address the Python versioning concerns for RHEL users?
>
> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
> wrote:
>
>> yeah, the practical concern is that we have no control over java or
>> python version on large company clusters. our current reality for the 
>> vast
>> majority of them is java 7 and python 2.6, no matter how outdated that 
>> is.
>>
>> i dont like it either, but i cannot change it.
>>
>> we currently don't use pyspark so i have no stake in this, but if we
>> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
>> dropped. no point in developing something that doesnt run for majority of
>> customers.
>>
>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>> until 2020. So I'm assuming these large companies will have the option 
>>> of
>>> riding out Python 2.6 until then.
>>>
>>> Are we seriously saying that Spark should likewise support Python
>>> 2.6 for the next several years? Even though the core Python devs stopped
>>> supporting it in 2013?
>>>
>>> If that's not what we're suggesting, then when, roughly, can we drop
>>> support? What are the criteria?
>>>
>>> I understand the practical concern here. If companies are stuck
>>> using 2.6, it doesn't matter to them that it is deprecated. But 
>>> balancing
>>> that concern against the maintenance burden on this project, I would say
>>> that "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable
>>> position to take. There are many tiny annoyances one has to put up with 
>>> to
>>> support 2.6.
>>>
>>> I suppose if our main PySpark contributors are fine putting up with
>>> those annoyances, then maybe we don't need to drop support just yet...
>>>

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Michael Armbrust
This would also be possible with an Aggregator in Spark 1.6:
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html

On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu  wrote:

> Something like the following:
>
> val zeroValue = collection.mutable.Set[String]()
>
> val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v,
> (setOne, setTwo) => setOne ++= setTwo)
>
> On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue  wrote:
>
>> Hey,
>>
>> For example, a table df with two columns
>> id  name
>> 1   abc
>> 1   bdf
>> 2   ab
>> 2   cd
>>
>> I want to group by the id and concat the string into array of string.
>> like this
>>
>> id
>> 1 [abc,bdf]
>> 2 [ab, cd]
>>
>> How could I achieve this in dataframe?  I stuck on df.groupBy("id"). ???
>>
>> Thanks
>>
>>
>


Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Hi Sandeep
I am not sure if ORC can be read directly in R.
But there can be a workaround .First create hive table on top of ORC files
and then access hive table in R.

Thanks
Deepak

On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
wrote:

> Hello
>
> I need to read an ORC files in hdfs in R using spark. I am not able to
> find a package to do that.
>
> Can anyone help with documentation or example for this purpose?
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Thanks Deepak.

I tried this as well. I created a hivecontext   with  "hivecontext <<-
sparkRHive.init(sc) "  .

When I tried to read hive table from this ,

results <- sql(hivecontext, "FROM test SELECT id")

I get below error,

Error in callJMethod(sqlContext, "sql", sqlQuery) :
  Invalid jobj 2. If SparkR was restarted, Spark operations need to be
re-executed.


Not sure what is causing this? Any leads or ideas? I am using rstudio.



On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma  wrote:

> Hi Sandeep
> I am not sure if ORC can be read directly in R.
> But there can be a workaround .First create hive table on top of ORC files
> and then access hive table in R.
>
> Thanks
> Deepak
>
> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
> wrote:
>
>> Hello
>>
>> I need to read an ORC files in hdfs in R using spark. I am not able to
>> find a package to do that.
>>
>> Can anyone help with documentation or example for this purpose?
>>
>> --
>> Architect
>> Infoworks.io
>> http://Infoworks.io
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



-- 
Architect
Infoworks.io
http://Infoworks.io


  1   2   >