Not able to receive data in spark from rsyslog

2015-12-03 Thread masoom alam
I am getting am error that I am not able receive data in spark streaming
application from spark.please help with any pointers.
9 - java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.(Socket.java:425)
at java.net.Socket.(Socket.java:208)
at
org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:73)
at
org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:59)

15/12/04 02:21:29 INFO ReceiverSupervisorImpl: Stopped receiver 0

However from nc -lk 999 gives the data which is received perfectlyany
clue...

Thanks


Re: How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Madabhattula Rajesh Kumar
Hi JB and Ted,

Thank you very much for the steps

Regards,
Rajesh

On Thu, Dec 3, 2015 at 8:16 PM, Ted Yu  wrote:

> See this thread for Spark 1.6.0 RC1
>
>
> http://search-hadoop.com/m/q3RTtKdUViYHH1b1=+VOTE+Release+Apache+Spark+1+6+0+RC1+
>
> Cheers
>
> On Thu, Dec 3, 2015 at 12:39 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi Team,
>>
>> Looks like this issue is fixed in 1.6 release. How to test this fix? Is
>> any jar is available? So I can add that jar in dependency and test this
>> fix. (Or) Any other way, I can test this fix in 1.15.2 code base.
>>
>> Could you please let me know the steps. Thank you for your support
>>
>> Regards,
>> Rajesh
>>
>
>


Re: Python API Documentation Mismatch

2015-12-03 Thread Yanbo Liang
Hi Roberto,

There are two ALS available: ml.recommendation.ALS

and mllib.recommendation.ALS

 .
They have different usage and methods. I know it's confusion that Spark
provide two version of the same algorithm. I strongly recommend to use the
ALS algorithm at ML package.

Yanbo

2015-12-04 1:31 GMT+08:00 Felix Cheung :

> Please open an issue in JIRA, thanks!
>
>
>
>
>
> On Thu, Dec 3, 2015 at 3:03 AM -0800, "Roberto Pagliari" <
> roberto.pagli...@asos.com> wrote:
>
> Hello,
> I believe there is a mismatch between the API documentation (1.5.2) and
> the software currently available.
>
> Not all functions mentioned here
>
> http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation
>
> are, in fact available. For example, the code below from the tutorial works
>
> # Build the recommendation model using Alternating Least Squaresrank = 
> 10numIterations = 10model = ALS.train(ratings, rank, numIterations)
>
>
> While the alternative shown in the API documentation will not (it will
> complain that ALS takes no arguments. Also, but inspecting the module with
> Python utilities I could not find several methods mentioned in the API docs)
>
> >>> df = sqlContext.createDataFrame(... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 
> >>> 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],... ["user", "item", 
> >>> "rating"])>>> als = ALS(rank=10, maxIter=5)>>> model = als.fit(df)
>
>
>
> Thank you,
>
>


Re: Re: spark master - run-tests error

2015-12-03 Thread wei....@kaiyuandao.com
it works like a charm for me. thanks for the quick workaround



 
From: Ted Yu
Date: 2015-12-04 10:45
To: wei@kaiyuandao.com
CC: user
Subject: Re: spark master - run-tests error
From dev/run-tests.py :

def identify_changed_files_from_git_commits(patch_sha, target_branch=None, 
target_ref=None):
"""
Given a git commit and target ref, use the set of files changed in the diff 
in order to
determine which modules' tests should be run.

Looks like the script needs git commit to determine which modules test suite 
should be run.

I use the following command to run tests:

mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0 
package

FYI

On Thu, Dec 3, 2015 at 6:09 PM, wei@kaiyuandao.com  
wrote:

hi, is there anyone knowing why I came to the following error when running 
tests after a successful full build? thanks

[root@sandbox spark_git]# dev/run-tests
**
File "./dev/run-tests.py", line 68, in 
__main__.identify_changed_files_from_git_commits
Failed example:
[x.name for x in determine_modules_for_files( 
identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
  File "", 
line 1, in 
[x.name for x in determine_modules_for_files( 
identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
  File "./dev/run-tests.py", line 84, in 
identify_changed_files_from_git_commits
raw_output = subprocess.check_output(['git', 'diff', '--name-only', 
patch_sha, diff_target],
AttributeError: 'module' object has no attribute 'check_output'
**
File "./dev/run-tests.py", line 70, in 
__main__.identify_changed_files_from_git_commits
Failed example:
'root' in [x.name for x in determine_modules_for_files(  
identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
  File "", 
line 1, in 
'root' in [x.name for x in determine_modules_for_files(  
identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]




Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-03 Thread Divya Gehlot
Hello,
Even I have the same queries in mind .
What all the upgrades where we can use EC2 as compare to normal servers for
spark and other big data product development .
Hope to get inputs from the community .

Thanks,
Divya
On Dec 4, 2015 6:05 AM, "Andy Davidson" 
wrote:

> About 2 months ago I used spark-ec2 to set up a small cluster. The cluster
> runs a spark streaming app 7x24 and stores the data to hdfs. I also need to
> run some batch analytics on the data.
>
> Now that I have a little more experience I wonder if this was a good way
> to set up the cluster the following issues
>
>1. I have not been able to find explicit directions for upgrading the
>spark version
>   1.
>   
> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2=Re+Upgrading+Spark+in+EC2+clusters
>2. I am not sure where the data is physically be stored. I think I may
>accidentally loose all my data
>3. spark-ec2 makes it easy to launch a cluster with as many machines
>as you like how ever Its not clear how I would add slaves to an existing
>installation
>
>
> Our Java streaming app we call rdd.saveAsTextFile(“hdfs://path”);
>
> ephemeral-hdfs/conf/hdfs-site.xml:
>
>   
>
> dfs.data.dir
>
> /mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data
>
>   
>
>
> persistent-hdfs/conf/hdfs-site.xml
>
>
> $ mount
>
> /dev/xvdb on /mnt type ext3 (rw,nodiratime)
>
> /dev/xvdf on /mnt2 type ext3 (rw,nodiratime)
>
>
> http://spark.apache.org/docs/latest/ec2-scripts.html
>
> *"*The spark-ec2 script also supports pausing a cluster. In this case,
> the VMs are stopped but not terminated, so they *lose all data on
> ephemeral disks* but keep the data in their root partitions and their
> persistent-pdfs.”
>
>
> Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy
> to use. I incorrectly thought spark some how knew how HDFS partitioned my
> data.
>
> I think many people are using amazon s3. I do not have an direct
> experience with S3. My concern would be that the data is not physically
> stored closed to my slaves. I.e. High communication costs.
>
> Any suggestions would be greatly appreciated
>
> Andy
>


Re: Re: spark sql cli query results written to file ?

2015-12-03 Thread fightf...@163.com
Well , Sorry for late reponse and thanks a lot for pointing out the clue.  



fightf...@163.com
 
From: Akhil Das
Date: 2015-12-03 14:50
To: Sahil Sareen
CC: fightf...@163.com; user
Subject: Re: spark sql cli query results written to file ?
Oops 3 mins late. :)

Thanks
Best Regards

On Thu, Dec 3, 2015 at 11:49 AM, Sahil Sareen  wrote:
Yeah, Thats the example from the link I just posted.

-Sahil

On Thu, Dec 3, 2015 at 11:41 AM, Akhil Das  wrote:
Something like this?

val df = sqlContext.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

It will save the name, favorite_color columns to a parquet file. You can read 
more information over here 
http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes



Thanks
Best Regards

On Thu, Dec 3, 2015 at 11:35 AM, fightf...@163.com  wrote:
HI,
How could I save the spark sql cli running queries results and write the 
results to some local file ? 
Is there any available command ? 

Thanks,
Sun.



fightf...@163.com





How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Madabhattula Rajesh Kumar
Hi Team,

Looks like this issue is fixed in 1.6 release. How to test this fix? Is any
jar is available? So I can add that jar in dependency and test this fix.
(Or) Any other way, I can test this fix in 1.15.2 code base.

Could you please let me know the steps. Thank you for your support

Regards,
Rajesh


Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread Sachin Aggarwal
Hi All,

need help guys, I need a work around for this situation

*case where this works:*

val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"),
("Rishabh", "2"))).toDF("myText", "id")

TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce error case:
1) create a file copy following text--filename(a.json)

{ "myText": "Sachin Aggarwal", "id": "1"}
{ "myText": "Rishabh", "id": "2"}

2) define a simple UDF
def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}

3) register the udf
 sqlContext.udf.register("mydef" ,mydef _)

4) read the input file
val TestDoc2=sqlContext.read.json("/tmp/a.json")

5) make a call to UDF
TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show

ERROR received:
java.lang.IllegalArgumentException: Field "Text" does not exist.
 at
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
 at scala.collection.AbstractMap.getOrElse(Map.scala:58)
 at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
 at
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
 at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
 at
org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
 at
$line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
 at
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
 at
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
 at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
 at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
Source)
 at
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
 at
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.lang.Thread.run(Thread.java:857)

-- 

Thanks & Regards

Sachin Aggarwal
7760502772


Re: Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread Sahil Sareen
Attaching the JIRA as well for completeness:
https://issues.apache.org/jira/browse/SPARK-12117

On Thu, Dec 3, 2015 at 4:13 PM, Sachin Aggarwal 
wrote:

>
> Hi All,
>
> need help guys, I need a work around for this situation
>
> *case where this works:*
>
> val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"),
> ("Rishabh", "2"))).toDF("myText", "id")
>
> TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
>
>
> steps to reproduce error case:
> 1) create a file copy following text--filename(a.json)
>
> { "myText": "Sachin Aggarwal", "id": "1"}
> { "myText": "Rishabh", "id": "2"}
>
> 2) define a simple UDF
> def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}
>
> 3) register the udf
>  sqlContext.udf.register("mydef" ,mydef _)
>
> 4) read the input file
> val TestDoc2=sqlContext.read.json("/tmp/a.json")
>
> 5) make a call to UDF
>
> TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
>
> ERROR received:
> java.lang.IllegalArgumentException: Field "Text" does not exist.
>  at
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>  at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
>  at
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
>  at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
>  at
> org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
>  at
> $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
>  at
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
>  at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>  at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>  at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
> Source)
>  at
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
>  at
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>  at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>  at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>  at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>  at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>  at org.apache.spark.scheduler.Task.run(Task.scala:88)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>  at java.lang.Thread.run(Thread.java:857)
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772
>


How and where to update release notes for spark rel 1.6?

2015-12-03 Thread RaviShankar KS
Hi,

How and where to update release notes for spark rel 1.6?
pls help.

There are a few methods with changed params, and a few deprecated ones that
need to be documented.

Thanks
Ravi


Re: Spark Streaming from S3

2015-12-03 Thread Steve Loughran

On 3 Dec 2015, at 00:42, Michele Freschi 
> wrote:

Hi all,

I have an app streaming from s3 (textFileStream) and recently I've observed 
increasing delay and long time to list files:

INFO dstream.FileInputDStream: Finding new files took 394160 ms
...
INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020 ms 
(execution: 10.154 s)

At this time I have about 13K files under the key prefix that I'm monitoring - 
hadoop takes about 6 minutes to list all the files while aws cli takes only 
seconds.
My understanding is that this is a current limitation of hadoop but I wanted to 
confirm it in case it's a misconfiguration on my part.

not a known issue.

Usual questions: which Hadoop version and are you using s3n or s3a connectors. 
The latter does use the AWS sdk, but it's only been stable enough to use in 
Hadoop 2.7


Some alternatives I'm considering:
1. copy old files to a different key prefix
2. use one of the available SQS receivers 
(https://github.com/imapi/spark-sqs-receiver ?)
3. implement the s3 listing outside of spark and use socketTextStream, but I 
couldn't find if it's reliable or not
4. create a custom s3 receiver using aws sdk (even if doesn't look like it's 
possible to use them from pyspark)

Has anyone experienced the same issue and found a better way to solve it ?

Thanks,
Michele




Re: Building spark 1.3 from source code to work with Hive 1.2.1

2015-12-03 Thread zhangjp
I have encountered the same issues. before I changed the spark version i setted 
up environment as follows.
 spark 1.5.2 
 hadoop 2.6.2
 hive 1.2.1
 but no luck it's not work well, even through i run essembly hive in spark with 
jdbc mode there is also some properblems.
 then I changed the spark version 1.3.1 and rebuild, I just run a example 
,there is a issuse seems the pb version conflict, will rebuild and try 
again.
 Exception in thread "main" java.lang.VerifyError: class 
org.apache.hadoop.yarn.proto.YarnProtos$PriorityProto overrides final method 
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;  at 
java.lang.ClassLoader.defineClass1(Native Method)at 
java.lang.ClassLoader.defineClass(ClassLoader.java:800)  at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
  
  
 
 
 

 -- Original --
  From:  "Mich Talebzadeh";;
 Date:  Thu, Dec 3, 2015 06:28 PM
 To:  "user"; "user"; 
 
 Subject:  Building spark 1.3 from source code to work with Hive 1.2.1

 

  
Hi,
 
 
 
I have seen mails that state that the user has managed to build spark 1.3 to 
work with Hive. I tried Spark 1.5.2 but no luck
 
 
 
I downloaded spark source 1.3 source code spark-1.3.0.tar and built it as 
follows
 
 
 
./make-distribution.sh --name "hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
 
 
 
This successfully completed and created the tarred zip file. I then created 
spark 1.3 tree from this zipped file. $SPARK_HOME is /usr/lib/spark
 
 
 
Other steps that I performed:
 
 
 
1.In $HIVE_HOME/lib , I copied  spark-assembly-1.3.0-hadoop2.4.0.jar  to 
this directory
 
2.  In $SPARK_HOME/conf I created a syblink to /usr/lib/hive/conf/hive-site.xml
 
 
 
Then I tried to start spark master node
 
 
 
/usr/lib/spark/sbin/start-master.sh
 
 
 
I get the following error:
 
 
 
 
 
cat 
/usr/lib/spark/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
 
Spark Command: /usr/java/latest/bin/java -cp 
:/usr/lib/spark/sbin/../conf:/usr/lib/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home/hduser/hadoop-2.6.0/etc/hadoop
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.master.Master --ip rhes564 --port 7077 --webui-port 8080
 

 
 
 
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
 
at java.lang.Class.getDeclaredMethods0(Native Method)
 
at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)
 
at java.lang.Class.getMethod0(Class.java:2764)
 
at java.lang.Class.getMethod(Class.java:1653)
 
at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
 
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
 
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 
at java.security.AccessController.doPrivileged(Native Method)
 
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 
 
 
I also notice that in /usr/lib/spark/lib, I only have the following jar files
 
 
 
-rw-r--r-- 1 hduser hadoop 98795479 Dec  3 09:03 
spark-examples-1.3.0-hadoop2.4.0.jar
 
-rw-r--r-- 1 hduser hadoop 98187168 Dec  3 09:03 
spark-assembly-1.3.0-hadoop2.4.0.jar
 
-rw-r--r-- 1 hduser hadoop  4136760 Dec  3 09:03 spark-1.3.0-yarn-shuffle.jar
 
 
 
Wheras in pre-build downloaded one à /usr/lib/spark-1.3.0-bin-hadoop2.4,  there 
are additional  JAR files
 
 
 
-rw-rw-r-- 1 hduser hadoop   1890075 Mar  6  2015 datanucleus-core-3.2.10.jar
 
-rw-rw-r-- 1 hduser hadoop 112446389 Mar  6  2015 
spark-examples-1.3.0-hadoop2.4.0.jar
 
-rw-rw-r-- 1 hduser hadoop 159319006 Mar  6  2015 
spark-assembly-1.3.0-hadoop2.4.0.jar
 
-rw-rw-r-- 1 hduser hadoop   4136744 Mar  6  2015 spark-1.3.0-yarn-shuffle.jar
 
-rw-rw-r-- 1 hduser hadoop   1809447 Mar  6  2015 datanucleus-rdbms-3.2.9.jar
 
-rw-rw-r-- 1 hduser hadoop339666 Mar  6  2015 datanucleus-api-jdo-3.2.6.jar
 
 
 
Any ideas what is is missing? I am sure someone has sorted this one out before.
 
 
 
 
 
Thanks,
 
 
 
Mich
 
 
 
 
 
 
 
NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that 

Python API Documentation Mismatch

2015-12-03 Thread Roberto Pagliari
Hello,
I believe there is a mismatch between the API documentation (1.5.2) and the 
software currently available.

Not all functions mentioned here
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation

are, in fact available. For example, the code below from the tutorial works

# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)

While the alternative shown in the API documentation will not (it will complain 
that ALS takes no arguments. Also, but inspecting the module with Python 
utilities I could not find several methods mentioned in the API docs)

>>> df = sqlContext.createDataFrame(
... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 
2, 5.0)],
... ["user", "item", "rating"])
>>> als = ALS(rank=10, maxIter=5)
>>> model = als.fit(df)


Thank you,



Re: Checkpointing not removing shuffle files from local disk

2015-12-03 Thread Ewan Higgs
Hi all,
We are running a class with Pyspark notebook for data analysis. Some of
the books are fairly long and have a lot of operations. Through the
course of the notebook, the shuffle storage expands considerably and
often exceeds quota (e.g. 1.5GB input expands to 24GB in shuffle
files). Closing and reopening the notebook doesn't clean out the
shuffle directory.

FWIW, the shuffle memory really explodes when we use ALS.

There is a ticket to make sure this is well documented, but there are
also suggestions that the problem should have gone away with Spark 1.0:

https://issues.apache.org/jira/browse/SPARK-5836

Yours,
Ewan

On Tue, 2015-09-29 at 01:18 -0700, ramibatal wrote:
> Hi all,
> 
> I am applying MLlib LDA for topic modelling. I am setting up the the
> lda
> parameter as follow:
> 
> lda.setOptimizer(optimizer)
>   .setK(params.k)
>   .setMaxIterations(params.maxIterations)
>   .setDocConcentration(params.docConcentration)
>   .setTopicConcentration(params.topicConcentration)
>   .setCheckpointInterval(params.checkpointInterval)
>   if (params.checkpointDir.nonEmpty) {
>   sc.setCheckpointDir(params.checkpointDir.get)
>  }
> 
> 
> I am running the LDA algorithm on my local MacOS machine, on a corpus
> of
> 800,000 english text documents (total size 9GB), and my machine has 8
> cores
> with 16GB or RAM and 500GB or hard disk.
> 
> Here are my Spark configurations:
> 
> val conf = new
> SparkConf().setMaster("local[6]").setAppName("LDAExample")
> val sc = new SparkContext(conf)
> 
> 
> When calling the LDA with a large number of iteration (100) (i.e. by
> calling
> val ldaModel = lda.run(corpus)), the algorithm start to create
> shuffle files
> on my disk at at point that it fills it up till there is space left.
> 
> I am using spark-submit to run my program as follow:
> 
> spark-submit --driver-memory 14G --class
> com.heystaks.spark.ml.topicmodelling.LDAExample
> ./target/scala-2.10/lda-assembly-1.0.jar path/to/copurs/file --k 100
> --maxIterations 100 --checkpointDir /Users/ramialbatal/checkpoints
> --checkpointInterval 1
> 
> 
> Where 'K' is the number of topics to extract, when the number of
> iterations
> and topics are small everything is fine, but when there is large
> iteration
> number like 100, no matter what is the value of --checkpointInterval
> the
> phenomenon is the same: disk will fill up after about 25 iteration.
> 
> Everything seems to run correctly and the checkpoints files are
> created on
> my disk but the shuffle files are not removed at all.
> 
> I am using Spark and MLlib 1.5.0, and my machine is Mac Yosemite
> 10.10.5.
> 
> Any help is highly appreciated. Thanks
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n
> 3.nabble.com/Checkpointing-not-removing-shuffle-files-from-local-
> disk-tp24857.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark1.4.1 extremely slow for take(1) or head() or first() or show

2015-12-03 Thread Mich Talebzadeh
Can you try running it directly on hive to see the timing or through
spark-sql may be.

 

Spark does what Hive does that is processing large sets of data, but it
attempts to do the intermediate iterations in memory if it can (i.e. if
there is enough memory available to keep the data set in memory), otherwise
it will have to use disk space. So it boils down to how much memory you
have. 

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: hxw黄祥为 [mailto:huang...@ctrip.com] 
Sent: 03 December 2015 10:29
To: user@spark.apache.org
Subject: spark1.4.1 extremely slow for take(1) or head() or first() or show

 

Dear All,

 

I have a hive table with 100 million data and I just ran some very simple
operations on this dataset like:

 

  val df = sqlContext.sql("select * from user ").toDF

  df.cache

  df.registerTempTable("tb")

  val b=sqlContext.sql("select  'uid',max(length(uid)),count(distinct(uid)),
count(uid),sum(case when uid is null then 0 else 1 end),sum(case when uid is
null then 1 else 0 end),sum(case when uid is null then 1 else 0
end)/count(uid) from tb")

  b.show  //the result just one line but this step is extremely slow

 

Is this expected? Why show is so slow for dataframe? Is it a bug in the
optimizer? or I did something wrong? 

 

 

Best Regards,

tylor



Re: Can Spark Execute Hive Update/Delete operations

2015-12-03 Thread 张炜
Hi all,
Sorry the referenced link is not using a private/own branch of hive. It's
using Hortonworks 2.3 and the hive packaged in HDP2.3, and installed a
standalone version of Spark cluster (1.5.2)

But the Hive on Spark cannot run.

Could anyone help on this? Thanks a lot!

Regards,
Sai

On Wed, Dec 2, 2015 at 9:58 PM Ted Yu  wrote:

> The referenced link seems to be w.r.t. Hive on Spark which is still in its
> own branch of Hive.
>
> FYI
>
> On Tue, Dec 1, 2015 at 11:23 PM, 张炜  wrote:
>
>> Hello Ted and all,
>> We are using Hive 1.2.1 and Spark 1.5.1
>> I also noticed that there are other users reporting this problem.
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-spark-on-hive-td25372.html#a25486
>> Thanks a lot for help!
>>
>> Regards,
>> Sai
>>
>> On Wed, Dec 2, 2015 at 11:11 AM Ted Yu  wrote:
>>
>>> Can you tell us the version of Spark and hive you use ?
>>>
>>> Thanks
>>>
>>> On Tue, Dec 1, 2015 at 7:08 PM, 张炜  wrote:
>>>
 Dear all,
 We have a requirement that needs to update delete records in hive.
 These operations are available in hive now.

 But when using hiveContext in Spark, it always pops up an "not
 supported" error.
 Is there anyway to support update/delete operations using spark?

 Regards,
 Sai

>>>
>>>
>


spark1.4.1 extremely slow for take(1) or head() or first() or show

2015-12-03 Thread hxw黄祥为
Dear All,



I have a hive table with 100 million data and I just ran some very simple 
operations on this dataset like:



  val df = sqlContext.sql("select * from user ").toDF
  df.cache
  df.registerTempTable("tb")
  val b=sqlContext.sql("select  
'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case when uid is 
null then 0 else 1 end),sum(case when uid is null then 1 else 0 end),sum(case 
when uid is null then 1 else 0 end)/count(uid) from tb")
  b.show  //the result just one line but this step is extremely slow

Is this expected? Why show is so slow for dataframe? Is it a bug in the 
optimizer? or I did something wrong?


Best Regards,
tylor


Re: spark1.4.1 extremely slow for take(1) or head() or first() or show

2015-12-03 Thread Sahil Sareen
"select  'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case
when uid is null then 0 else 1 end),sum(case when uid is null then 1 else 0
end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb"

Is this as is, or did you use a UDF here?

-Sahil

On Thu, Dec 3, 2015 at 4:06 PM, Mich Talebzadeh  wrote:

> Can you try running it directly on hive to see the timing or through
> spark-sql may be.
>
>
>
> Spark does what Hive does that is processing large sets of data, but it
> attempts to do the intermediate iterations in memory if it can (i.e. if
> there is enough memory available to keep the data set in memory), otherwise
> it will have to use disk space. So it boils down to how much memory you
> have.
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* hxw黄祥为 [mailto:huang...@ctrip.com]
> *Sent:* 03 December 2015 10:29
> *To:* user@spark.apache.org
> *Subject:* spark1.4.1 extremely slow for take(1) or head() or first() or
> show
>
>
>
> Dear All,
>
>
>
> I have a hive table with 100 million data and I just ran some very simple
> operations on this dataset like:
>
>
>
>   val df = sqlContext.sql("select * from user ").toDF
>
>   df.cache
>
>   df.registerTempTable("tb")
>
>   val b=sqlContext.sql("select
> 'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case when uid is
> null then 0 else 1 end),sum(case when uid is null then 1 else 0
> end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb")
>
>   b.show  //the result just one line but this step is extremely slow
>
>
>
> Is this expected? Why show is so slow for dataframe? Is it a bug in the
> optimizer? or I did something wrong?
>
>
>
>
>
> Best Regards,
>
> tylor
>


Re: LDA topic modeling and Spark

2015-12-03 Thread Robin East
What exactly is this probability distribution? For each word in your vocabulary 
it is the probability that a randomly drawn word from a topic is that word. 
Another way to visualise it is a 2-column vector where the 1st column is a word 
in your vocabulary and the 2nd column is the probability of that word 
appearing. All the values in the 2nd-column must be >= 0 and if you add up all 
the values they should sum to 1. That is the definition of a probability 
distribution. 

Clearly for the idea of topics to be at all useful you want different topics to 
exhibit different probability distributions i.e. some words to be more likely 
in 1 topic compared to another topic.

How does it actually infer words and topics? Probably a good idea to google for 
that one if you really want to understand the details - there are some great 
resources available.

How can I connect the output to the actual words in each topic? A typical way 
is to look at the top 5, 10 or 20 words in each topic and use those to infer 
something about what the topic represents.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 3 Dec 2015, at 05:07, Nguyen, Tiffany T  wrote:
> 
> Hello,
> 
> I have been trying to understand the LDA topic modeling example provided 
> here: 
> https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda
>  
> .
>  In the example, they load word count vectors from a text file that contains 
> these word counts and then they output the topics, which is represented as 
> probability distributions over words. What exactly is this probability 
> distribution? How does it actually infer words and topics and how can I 
> connect the output to the actual words in each topic?
> 
> Thanks!



Re: How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Jean-Baptiste Onofré

Hi Rajesh,

you can check codebase and build yourself in order to test:

git clone https://git-wip-us.apache.org/repos/asf/spark
cd spark
mvn clean package -DskipTests

You will have bin, sbin and conf folders to try it.

Regards
JB

On 12/03/2015 09:39 AM, Madabhattula Rajesh Kumar wrote:

Hi Team,

Looks like this issue is fixed in 1.6 release. How to test this fix? Is
any jar is available? So I can add that jar in dependency and test this
fix. (Or) Any other way, I can test this fix in 1.15.2 code base.

Could you please let me know the steps. Thank you for your support

Regards,
Rajesh


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Building spark 1.3 from source code to work with Hive 1.2.1

2015-12-03 Thread Mich Talebzadeh
Hi,

 

I have seen mails that state that the user has managed to build spark 1.3 to
work with Hive. I tried Spark 1.5.2 but no luck

 

I downloaded spark source 1.3 source code spark-1.3.0.tar and built it as
follows

 

./make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

 

This successfully completed and created the tarred zip file. I then created
spark 1.3 tree from this zipped file. $SPARK_HOME is /usr/lib/spark

 

Other steps that I performed:

 

1.In $HIVE_HOME/lib , I copied  spark-assembly-1.3.0-hadoop2.4.0.jar  to
this directory

2.  In $SPARK_HOME/conf I created a syblink to
/usr/lib/hive/conf/hive-site.xml

 

Then I tried to start spark master node

 

/usr/lib/spark/sbin/start-master.sh

 

I get the following error:

 

 

cat
/usr/lib/spark/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Mast
er-1-rhes564.out

Spark Command: /usr/java/latest/bin/java -cp
:/usr/lib/spark/sbin/../conf:/usr/lib/spark/lib/spark-assembly-1.3.0-hadoop2
.4.0.jar:/home/hduser/hadoop-2.6.0/etc/hadoop -XX:MaxPermSize=128m
-Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.master.Master --ip rhes564 --port 7077 --webui-port
8080



 

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

at java.lang.Class.getDeclaredMethods0(Native Method)

at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)

at java.lang.Class.getMethod0(Class.java:2764)

at java.lang.Class.getMethod(Class.java:1653)

at
sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)

at
sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

 

I also notice that in /usr/lib/spark/lib, I only have the following jar
files

 

-rw-r--r-- 1 hduser hadoop 98795479 Dec  3 09:03
spark-examples-1.3.0-hadoop2.4.0.jar

-rw-r--r-- 1 hduser hadoop 98187168 Dec  3 09:03
spark-assembly-1.3.0-hadoop2.4.0.jar

-rw-r--r-- 1 hduser hadoop  4136760 Dec  3 09:03
spark-1.3.0-yarn-shuffle.jar

 

Wheras in pre-build downloaded one --> /usr/lib/spark-1.3.0-bin-hadoop2.4,
there are additional  JAR files

 

-rw-rw-r-- 1 hduser hadoop   1890075 Mar  6  2015
datanucleus-core-3.2.10.jar

-rw-rw-r-- 1 hduser hadoop 112446389 Mar  6  2015
spark-examples-1.3.0-hadoop2.4.0.jar

-rw-rw-r-- 1 hduser hadoop 159319006 Mar  6  2015
spark-assembly-1.3.0-hadoop2.4.0.jar

-rw-rw-r-- 1 hduser hadoop   4136744 Mar  6  2015
spark-1.3.0-yarn-shuffle.jar

-rw-rw-r-- 1 hduser hadoop   1809447 Mar  6  2015
datanucleus-rdbms-3.2.9.jar

-rw-rw-r-- 1 hduser hadoop339666 Mar  6  2015
datanucleus-api-jdo-3.2.6.jar

 

Any ideas what is is missing? I am sure someone has sorted this one out
before.

 

 

Thanks,

 

Mich

 

 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 



One Problem about Spark Dynamic Allocation

2015-12-03 Thread 谢廷稳
Hi all,
I ran spark 1.4 with Dynamic Allocation enabled, when it was running, I can
see Executor's information, such as ID, Address, Shuffle Read/Write, logs
etc.But once executor was removed, the web page not display that executor
any more, finally, the spark app's information in Spark HistoryServer only
one executor, it was driver. So, I can't see Executor's information any
more.
Does it a bug? How can i fix it?

Thanks!
Weber


Re: How and where to update release notes for spark rel 1.6?

2015-12-03 Thread Jean-Baptiste Onofré

Hi Ravi,

Even if it's not perfect, you can take a look on the current 
ReleaseNotes on Jira:


https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420=12333083

Regards
JB

On 12/03/2015 12:01 PM, RaviShankar KS wrote:

Hi,

How and where to update release notes for spark rel 1.6?
pls help.

There are a few methods with changed params, and a few deprecated ones
that need to be documented.

Thanks
Ravi


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Sparse Vector ArrayIndexOutOfBoundsException

2015-12-03 Thread nabegh
I'm trying to run a SVM classifier on unlabeled data. I followed  this

  
to build the vectors and checked  this

  

Now when I make the call to predict, I receive the following error. Any
hints?

val v = featureRDD.map(f => Vectors.sparse(featureSet.length, f))
//length = 63
val predictions = model.predict(v)
println(s"predictions length  = ${predictions.collect.length}")


Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 1 in stage 121.0 failed 4 times, most recent failure:
Lost task 1.3 in stage 121.0 (TID 233, 10.1.1.63):
java.lang.ArrayIndexOutOfBoundsException: 62
at
breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$98.apply(SparseVectorOps.scala:297)
at
breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$98.apply(SparseVectorOps.scala:282)
at
breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60)
at breeze.linalg.VectorOps$$anon$171.apply(Vector.scala:528)
at breeze.linalg.ImmutableNumericOps$class.dot(NumericOps.scala:98)
at breeze.linalg.DenseVector.dot(DenseVector.scala:50)
at
org.apache.spark.mllib.classification.SVMModel.predictPoint(SVM.scala:81)
at
org.apache.spark.mllib.regression.GeneralizedLinearModel$$anonfun$predict$1$$anonfun$apply$1.apply(GeneralizedLinearAlgorithm.scala:71)
at
org.apache.spark.mllib.regression.GeneralizedLinearModel$$anonfun$predict$1$$anonfun$apply$1.apply(GeneralizedLinearAlgorithm.scala:71)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:909)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:909)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sparse-Vector-ArrayIndexOutOfBoundsException-tp25556.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Ted Yu
See this thread for Spark 1.6.0 RC1

http://search-hadoop.com/m/q3RTtKdUViYHH1b1=+VOTE+Release+Apache+Spark+1+6+0+RC1+

Cheers

On Thu, Dec 3, 2015 at 12:39 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi Team,
>
> Looks like this issue is fixed in 1.6 release. How to test this fix? Is
> any jar is available? So I can add that jar in dependency and test this
> fix. (Or) Any other way, I can test this fix in 1.15.2 code base.
>
> Could you please let me know the steps. Thank you for your support
>
> Regards,
> Rajesh
>


Re: Multiplication on decimals in a dataframe query

2015-12-03 Thread Philip Dodds
I'll open up a JIRA for it,  it appears to work when you use a literal
number but not when it is coming from the same dataframe

Thanks!

P

On Thu, Dec 3, 2015 at 1:52 AM, Sahil Sareen  wrote:

> +1 looks like a bug
>
> I think referencing trades() twice in multiplication is broken,
>
> scala> trades.select(trades("quantity")*trades("quantity")).show
>
> +-+
> |(quantity * quantity)|
> +-+
> | null|
> | null|
>
> scala> sqlContext.sql("select price*price as PP from trades").show
>
> ++
> |  PP|
> ++
> |null|
> |null|
>
>
> -Sahil
>
> On Thu, Dec 3, 2015 at 12:02 PM, Akhil Das 
> wrote:
>
>> Not quiet sure whats happening, but its not an issue with multiplication
>> i guess as the following query worked for me:
>>
>> trades.select(trades("price")*9.5).show
>> +-+
>> |(price * 9.5)|
>> +-+
>> |199.5|
>> |228.0|
>> |190.0|
>> |199.5|
>> |190.0|
>> |256.5|
>> |218.5|
>> |275.5|
>> |218.5|
>> ..
>> ..
>>
>>
>> Could it be with the precision? ccing dev list, may be you can open up a
>> jira for this as it seems to be a bug.
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Nov 30, 2015 at 12:41 AM, Philip Dodds 
>> wrote:
>>
>>> I hit a weird issue when I tried to multiply to decimals in a select
>>> (either in scala or as SQL), and Im assuming I must be missing the point.
>>>
>>> The issue is fairly easy to recreate with something like the following:
>>>
>>>
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>> import sqlContext.implicits._
>>> import org.apache.spark.sql.types.Decimal
>>>
>>> case class Trade(quantity: Decimal,price: Decimal)
>>>
>>> val data = Seq.fill(100) {
>>>   val price = Decimal(20+scala.util.Random.nextInt(10))
>>> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>>>
>>>   Trade(quantity, price)
>>> }
>>>
>>> val trades = sc.parallelize(data).toDF()
>>> trades.registerTempTable("trades")
>>>
>>> trades.select(trades("price")*trades("quantity")).show
>>>
>>> sqlContext.sql("select
>>> price/quantity,price*quantity,price+quantity,price-quantity from
>>> trades").show
>>>
>>> The odd part is if you run it you will see that the addition/division
>>> and subtraction works but the multiplication returns a null.
>>>
>>> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
>>>
>>> ie.
>>>
>>> +--+
>>>
>>> |(price * quantity)|
>>>
>>> +--+
>>>
>>> |  null|
>>>
>>> |  null|
>>>
>>> |  null|
>>>
>>> |  null|
>>>
>>> |  null|
>>>
>>> +--+
>>>
>>>
>>> +++++
>>>
>>> | _c0| _c1| _c2| _c3|
>>>
>>> +++++
>>>
>>> |0.952380952380952381|null|41.00...|-1.00...|
>>>
>>> |1.380952380952380952|null|50.00...|8.00|
>>>
>>> |1.272727272727272727|null|50.00...|6.00|
>>>
>>> |0.83|null|44.00...|-4.00...|
>>>
>>> |1.00|null|58.00...|   0E-18|
>>>
>>> +++++
>>>
>>>
>>> Just keen to know what I did wrong?
>>>
>>>
>>> Cheers
>>>
>>> P
>>>
>>> --
>>> Philip Dodds
>>>
>>>
>>>
>>
>


-- 
Philip Dodds

philip.do...@gmail.com
@philipdodds


Re: Multiplication on decimals in a dataframe query

2015-12-03 Thread Philip Dodds
Opened https://issues.apache.org/jira/browse/SPARK-12128

Thanks

P

On Thu, Dec 3, 2015 at 8:51 AM, Philip Dodds  wrote:

> I'll open up a JIRA for it,  it appears to work when you use a literal
> number but not when it is coming from the same dataframe
>
> Thanks!
>
> P
>
> On Thu, Dec 3, 2015 at 1:52 AM, Sahil Sareen  wrote:
>
>> +1 looks like a bug
>>
>> I think referencing trades() twice in multiplication is broken,
>>
>> scala> trades.select(trades("quantity")*trades("quantity")).show
>>
>> +-+
>> |(quantity * quantity)|
>> +-+
>> | null|
>> | null|
>>
>> scala> sqlContext.sql("select price*price as PP from trades").show
>>
>> ++
>> |  PP|
>> ++
>> |null|
>> |null|
>>
>>
>> -Sahil
>>
>> On Thu, Dec 3, 2015 at 12:02 PM, Akhil Das 
>> wrote:
>>
>>> Not quiet sure whats happening, but its not an issue with multiplication
>>> i guess as the following query worked for me:
>>>
>>> trades.select(trades("price")*9.5).show
>>> +-+
>>> |(price * 9.5)|
>>> +-+
>>> |199.5|
>>> |228.0|
>>> |190.0|
>>> |199.5|
>>> |190.0|
>>> |256.5|
>>> |218.5|
>>> |275.5|
>>> |218.5|
>>> ..
>>> ..
>>>
>>>
>>> Could it be with the precision? ccing dev list, may be you can open up a
>>> jira for this as it seems to be a bug.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Nov 30, 2015 at 12:41 AM, Philip Dodds 
>>> wrote:
>>>
 I hit a weird issue when I tried to multiply to decimals in a select
 (either in scala or as SQL), and Im assuming I must be missing the point.

 The issue is fairly easy to recreate with something like the following:


 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.implicits._
 import org.apache.spark.sql.types.Decimal

 case class Trade(quantity: Decimal,price: Decimal)

 val data = Seq.fill(100) {
   val price = Decimal(20+scala.util.Random.nextInt(10))
 val quantity = Decimal(20+scala.util.Random.nextInt(10))

   Trade(quantity, price)
 }

 val trades = sc.parallelize(data).toDF()
 trades.registerTempTable("trades")

 trades.select(trades("price")*trades("quantity")).show

 sqlContext.sql("select
 price/quantity,price*quantity,price+quantity,price-quantity from
 trades").show

 The odd part is if you run it you will see that the addition/division
 and subtraction works but the multiplication returns a null.

 Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)

 ie.

 +--+

 |(price * quantity)|

 +--+

 |  null|

 |  null|

 |  null|

 |  null|

 |  null|

 +--+


 +++++

 | _c0| _c1| _c2| _c3|

 +++++

 |0.952380952380952381|null|41.00...|-1.00...|

 |1.380952380952380952|null|50.00...|8.00|

 |1.272727272727272727|null|50.00...|6.00|

 |0.83|null|44.00...|-4.00...|

 |1.00|null|58.00...|   0E-18|

 +++++


 Just keen to know what I did wrong?


 Cheers

 P

 --
 Philip Dodds



>>>
>>
>
>
> --
> Philip Dodds
>
> philip.do...@gmail.com
> @philipdodds
>
>


-- 
Philip Dodds

philip.do...@gmail.com
@philipdodds


Why does Spark job stucks and waits for only last tasks to get finished

2015-12-03 Thread unk1102
Hi I have Spark job where I keep queue of 12 Spark jobs to execute in
parallel. Now I see job is almost completed and only task is pending and
because of last task job will keep on waiting I can see in UI. Please see
attached snaps. Please help me how to resolve Spark jobs from waiting for
last tasks and hence it is not moving into SUCCEDED state it is always
running and it is chalking other jobs not to run. Please guide.

 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-Spark-job-stucks-and-waits-for-only-last-tasks-to-get-finished-tp2.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Local mode: Stages hang for minutes

2015-12-03 Thread Richard Marscher
Hi,

I'm doing some testing of workloads using local mode on a server. I see
weird behavior where a job is submitted to the application and it just
hangs for several minutes doing nothing. The stages are submitted as
pending and in the application UI the stage view claims no tasks have been
submitted. Suddenly after a few minutes things suddenly start and run
smoothly.

I'm running against tiny data sets the size of 10s to low 100s of items in
the RDD. I've been attaching with JProfiler, doing thread and heap dumps
but nothing is really standing out as to why Spark seems to periodically
pause for such a long time.

Has anyone else seen similar behavior or aware of some quirk of local mode
that could cause this kind of blocking?

-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



Problem with RDD of (Long, Byte[Array])

2015-12-03 Thread Hervé Yviquel
Hi all,

I have problem when using Array[Byte] in RDD operation.
When I join two different RDDs of type [(Long, Array[Byte])], I obtain wrong 
results... But if I translate the byte array in integer and join two different 
RDDs of type [(Long, Integer)], then the results is correct... Any idea ?

--
The code:

val byteRDD0 = sc.binaryRecords(path_arg0, 4).zipWithIndex.map{x => (x._2, 
x._1)}
val byteRDD1 = sc.binaryRecords(path_arg1, 4).zipWithIndex.map{x => (x._2, 
x._1)}

byteRDD0.foreach{x => println("BYTE0 " + x._1 + "=> " 
+ByteBuffer.wrap(x._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
byteRDD1.foreach{x => println("BYTE1 " + x._1 + "=> " 
+ByteBuffer.wrap(x._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}

val intRDD1 = byteRDD1.mapValues{x=> 
ByteBuffer.wrap(x).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt()}
val intRDD2 = byteRDD2.mapValues{x=> 
ByteBuffer.wrap(x).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt()}

val byteJOIN = byteRDD1.join(byteRDD2)
byteJOIN.foreach{x => println("BYTEJOIN " + x._1 + "=> " + 
ByteBuffer.wrap(x._2._1).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt() + " - 
"+ByteBuffer.wrap(x._2._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}

val intJOIN = intRDD1.join(intRDD2)
intJOIN.foreach{x => println("INTJOIN " + x._1 + "=> " + x._2._1 + " - "+ 
x._2._2)}


--
stdout:

BYTE0 0=> 1
BYTE0 1=> 3
BYTE0 2=> 5
BYTE0 3=> 7
BYTE0 4=> 9
BYTE0 5=> 11
BYTE0 6=> 13
BYTE0 7=> 15
BYTE0 8=> 17
BYTE0 9=> 19
BYTE0 10=> 21
BYTE0 11=> 23
BYTE0 12=> 25
BYTE0 13=> 27
BYTE0 14=> 29
BYTE0 15=> 31
BYTE0 16=> 33
BYTE0 17=> 35
BYTE0 18=> 37
BYTE0 19=> 39
BYTE0 20=> 41
BYTE0 21=> 43
BYTE0 22=> 45
BYTE0 23=> 47
BYTE0 24=> 49
BYTE0 25=> 51
BYTE0 26=> 53
BYTE0 27=> 55
BYTE0 28=> 57
BYTE0 29=> 59
BYTE1 0=> 0
BYTE1 1=> 1
BYTE1 2=> 2
BYTE1 3=> 3
BYTE1 4=> 4
BYTE1 5=> 5
BYTE1 6=> 6
BYTE1 7=> 7
BYTE1 8=> 8
BYTE1 9=> 9
BYTE1 10=> 10
BYTE1 11=> 11
BYTE1 12=> 12
BYTE1 13=> 13
BYTE1 14=> 14
BYTE1 15=> 15
BYTE1 16=> 16
BYTE1 17=> 17
BYTE1 18=> 18
BYTE1 19=> 19
BYTE1 20=> 20
BYTE1 21=> 21
BYTE1 22=> 22
BYTE1 23=> 23
BYTE1 24=> 24
BYTE1 25=> 25
BYTE1 26=> 26
BYTE1 27=> 27
BYTE1 28=> 28
BYTE1 29=> 29
BYTEJOIN 13=> 1 - 0
BYTEJOIN 19=> 1 - 0
BYTEJOIN 15=> 1 - 0
BYTEJOIN 4=> 1 - 0
BYTEJOIN 21=> 1 - 0
BYTEJOIN 16=> 1 - 0
BYTEJOIN 22=> 1 - 0
BYTEJOIN 25=> 1 - 0
BYTEJOIN 28=> 1 - 0
BYTEJOIN 29=> 1 - 0
BYTEJOIN 11=> 1 - 0
BYTEJOIN 14=> 1 - 0
BYTEJOIN 27=> 1 - 0
BYTEJOIN 0=> 1 - 0
BYTEJOIN 24=> 1 - 0
BYTEJOIN 23=> 1 - 0
BYTEJOIN 1=> 1 - 0
BYTEJOIN 6=> 1 - 0
BYTEJOIN 17=> 1 - 0
BYTEJOIN 3=> 1 - 0
BYTEJOIN 7=> 1 - 0
BYTEJOIN 9=> 1 - 0
BYTEJOIN 8=> 1 - 0
BYTEJOIN 12=> 1 - 0
BYTEJOIN 18=> 1 - 0
BYTEJOIN 20=> 1 - 0
BYTEJOIN 26=> 1 - 0
BYTEJOIN 10=> 1 - 0
BYTEJOIN 5=> 1 - 0
BYTEJOIN 2=> 1 - 0
INTJOIN 13=> 27 - 13
INTJOIN 19=> 39 - 19
INTJOIN 15=> 31 - 15
INTJOIN 4=> 9 - 4
INTJOIN 21=> 43 - 21
INTJOIN 16=> 33 - 16
INTJOIN 22=> 45 - 22
INTJOIN 25=> 51 - 25
INTJOIN 28=> 57 - 28
INTJOIN 29=> 59 - 29
INTJOIN 11=> 23 - 11
INTJOIN 14=> 29 - 14
INTJOIN 27=> 55 - 27
INTJOIN 0=> 1 - 0
INTJOIN 24=> 49 - 24
INTJOIN 23=> 47 - 23
INTJOIN 1=> 3 - 1
INTJOIN 6=> 13 - 6
INTJOIN 17=> 35 - 17
INTJOIN 3=> 7 - 3
INTJOIN 7=> 15 - 7
INTJOIN 9=> 19 - 9
INTJOIN 8=> 17 - 8
INTJOIN 12=> 25 - 12
INTJOIN 18=> 37 - 18
INTJOIN 20=> 41 - 20
INTJOIN 26=> 53 - 26
INTJOIN 10=> 21 - 10
INTJOIN 5=> 11 - 5
INTJOIN 2=> 5 - 2


Thanks,
Hervé
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



AWS CLI --jars comma problem

2015-12-03 Thread Yusuf Can Gürkan
Hello

I have a question about AWS CLI for people who use it.

I create a spark cluster with aws cli and i’m using spark step with jar 
dependencies. But as you can see below i can not set multiple jars because AWS 
CLI replaces comma with space in ARGS.

Is there a way of doing it? I can accept every kind of solutions. For example, 
i tried to merge these two jar dependencies but i could not manage it.


aws emr create-cluster 
…..
…..
Args=[--class,com.blabla.job,—jars,"/home/hadoop/first.jar,/home/hadoop/second.jar",/home/hadoop/main.jar,--verbose]
 


I also tried to escape comma with \\, but it did not work.

Re: Python API Documentation Mismatch

2015-12-03 Thread Felix Cheung
Please open an issue in JIRA, thanks!






On Thu, Dec 3, 2015 at 3:03 AM -0800, "Roberto Pagliari" 
 wrote:





Hello,
I believe there is a mismatch between the API documentation (1.5.2) and the 
software currently available.

Not all functions mentioned here
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation

are, in fact available. For example, the code below from the tutorial works

# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)

While the alternative shown in the API documentation will not (it will complain 
that ALS takes no arguments. Also, but inspecting the module with Python 
utilities I could not find several methods mentioned in the API docs)

>>> df = sqlContext.createDataFrame(
... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 
2, 5.0)],
... ["user", "item", "rating"])
>>> als = ALS(rank=10, maxIter=5)
>>> model = als.fit(df)


Thank you,



Re: Local mode: Stages hang for minutes

2015-12-03 Thread Richard Marscher
I should add that the pauses are not from GC and also in tracing the CPU
call tree in the JVM it seems like nothing is doing any work, just seems to
be idling or blocking.

On Thu, Dec 3, 2015 at 11:24 AM, Richard Marscher 
wrote:

> Hi,
>
> I'm doing some testing of workloads using local mode on a server. I see
> weird behavior where a job is submitted to the application and it just
> hangs for several minutes doing nothing. The stages are submitted as
> pending and in the application UI the stage view claims no tasks have been
> submitted. Suddenly after a few minutes things suddenly start and run
> smoothly.
>
> I'm running against tiny data sets the size of 10s to low 100s of items in
> the RDD. I've been attaching with JProfiler, doing thread and heap dumps
> but nothing is really standing out as to why Spark seems to periodically
> pause for such a long time.
>
> Has anyone else seen similar behavior or aware of some quirk of local mode
> that could cause this kind of blocking?
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>



-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-03 Thread Cody Koeninger
Do you believe that all exceptions (including catastrophic ones like out of
heap space) should be caught and silently discarded?

Do you believe that a database system that runs out of disk space should
silently continue to accept writes?

What I am trying to say is, when something is broken in a way that cant be
fixed without external intervention, the system shouldn't hide it from
you.  Systems fail, that's a fact of life.  Pretending that a system hasn't
failed when it in fact is broken... not a good plan.



On Wed, Dec 2, 2015 at 11:38 PM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:

> There are other ways to deal with the problem than shutdown the streaming
> job. You can monitor the lag in your consumer to see if consumer if falling
> behind . If lag is too high that offsetOutOfRange can happen, you either
> increase retention period or increase consumer rate..or do both ..
>
> What I am trying to say, streaming job should not fail in any cases ..
>
> Dibyendu
>
> On Thu, Dec 3, 2015 at 9:40 AM, Cody Koeninger  wrote:
>
>> I believe that what differentiates reliable systems is individual
>> components should fail fast when their preconditions aren't met, and other
>> components should be responsible for monitoring them.
>>
>> If a user of the direct stream thinks that your approach of restarting
>> and ignoring data loss is the right thing to do, they can monitor the job
>> (which they should be doing in any case) and restart.
>>
>> If a user of your library thinks that my approach of failing (so they
>> KNOW there was data loss and can adjust their system) is the right thing to
>> do, how do they do that?
>>
>> On Wed, Dec 2, 2015 at 9:49 PM, Dibyendu Bhattacharya <
>> dibyendu.bhattach...@gmail.com> wrote:
>>
>>> Well, even if you do correct retention and increase speed,
>>> OffsetOutOfRange can still come depends on how your downstream processing
>>> is. And if that happen , there is No Other way to recover old messages . So
>>> best bet here from Streaming Job point of view  is to start from earliest
>>> offset rather bring down the streaming job . In many cases goal for a
>>> streaming job is not to shut down and exit in case of any failure. I
>>> believe that is what differentiate a always running streaming job.
>>>
>>> Dibyendu
>>>
>>> On Thu, Dec 3, 2015 at 8:26 AM, Cody Koeninger 
>>> wrote:
>>>
 No, silently restarting from the earliest offset in the case of offset
 out of range exceptions during a streaming job is not the "correct way of
 recovery".

 If you do that, your users will be losing data without knowing why.
 It's more like  a "way of ignoring the problem without actually addressing
 it".

 The only really correct way to deal with that situation is to recognize
 why it's happening, and either increase your Kafka retention or increase
 the speed at which you are consuming.

 On Wed, Dec 2, 2015 at 7:13 PM, Dibyendu Bhattacharya <
 dibyendu.bhattach...@gmail.com> wrote:

> This consumer which I mentioned does not silently throw away data. If
> offset out of range it start for earliest offset and that is correct way 
> of
> recovery from this error.
>
> Dibyendu
> On Dec 2, 2015 9:56 PM, "Cody Koeninger"  wrote:
>
>> Again, just to be clear, silently throwing away data because your
>> system isn't working right is not the same as "recover from any Kafka
>> leader changes and offset out of ranges issue".
>>
>>
>>
>> On Tue, Dec 1, 2015 at 11:27 PM, Dibyendu Bhattacharya <
>> dibyendu.bhattach...@gmail.com> wrote:
>>
>>> Hi, if you use Receiver based consumer which is available in
>>> spark-packages (
>>> http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) ,
>>> this has all built in failure recovery and it can recover from any Kafka
>>> leader changes and offset out of ranges issue.
>>>
>>> Here is the package form github :
>>> https://github.com/dibbhatt/kafka-spark-consumer
>>>
>>>
>>> Dibyendu
>>>
>>> On Wed, Dec 2, 2015 at 5:28 AM, swetha kasireddy <
>>> swethakasire...@gmail.com> wrote:
>>>
 How to avoid those Errors with receiver based approach? Suppose we
 are OK with at least once processing and use receiver based approach 
 which
 uses ZooKeeper but not query Kafka directly, would these 
 errors(Couldn't
 find leader offsets for
 Set([test_stream,5])))be avoided?

 On Tue, Dec 1, 2015 at 3:40 PM, Cody Koeninger 
 wrote:

> KafkaRDD.scala , handleFetchErr
>
> On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>> How to look at Option 2(see the following)? Which portion of 

Re: How the cores are used in Directstream approach

2015-12-03 Thread Cody Koeninger
There's a 1:1 relationship between Kafka partitions and Spark partitions.
Have you read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

A direct stream job will use up to spark.executor.cores number of cores.
If you have fewer partitions than cores, there probably won't be enough
tasks for all cores to be used at once.  If you have more partitions, a
given core will probably execute multiple tasks in a given batch.

On Wed, Dec 2, 2015 at 10:12 PM, Charan Ganga Phani Adabala <
char...@eiqnetworks.com> wrote:

> Hi,
>
> We have* 1 kafka topic*, by using the direct stream approach in spark we
> have to processing the data present in topic , with one node R cluster
> for to understand how the Spark will behave.
>
> My machine configuration is *4 Cores, 16 GB RAM with 1 executor.*
>
> My question is how many cores are used for this job while running.
>
> *In web console it show 4 cores are used.*
>
> *How the cores are used in Directstream approach*?
>
> Command to run the Job :
>
> *./spark/bin/spark-submit --master spark://XX.XX.XX.XXX:7077 --class
> org.eiq.IndexingClient ~/spark/lib/IndexingClient.jar*
>
>
>
> Thanks & Regards,
>
> *Ganga Phani Charan Adabala | Software Engineer*
>
> o:  +91-40-23116680 | c:  +91-9491418099
>
> EiQ Networks, Inc. 
>
>
>
>
>
> [image: cid:image001.png@01D11C9D.AF5CC1F0] 
>
> *"This email is intended only for the use of the individual or entity
> named above and may contain information that is confidential and
> privileged. If you are not the intended recipient, you are hereby notified
> that any dissemination, distribution or copying of the email is strictly
> prohibited. If you have received this email in error, please destroy
> the original message."*
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


understanding and disambiguating CPU-core related properties

2015-12-03 Thread Manolis Sifalakis1
I have found the documentation rather poor in helping me understand the 
interplay among the following properties in spark, even more how to set 
them. So this post is sent in hope for some discussion and "enlightenment" 
on the topic

Let me start by asking if I have understood well the following:

- spark.driver.cores:   how many cores the driver program should occupy
- spark.cores.max:   how many cores my app will claim for computations
- spark.executor.cores and spark.task.cpus:   how spark.cores.max are 
allocated per JVM (executor) and per task (java thread?)
  I.e. + spark.executor.cores:   each JVM instance (executor) should use 
that many cores
+ spark.task.cpus: each task shoudl occupy max this # or cores

If so far good, then...

q1: Is spark.cores.max inclusive or not of spark.driver.cores ?

q1: How should one decide statically a-priori how to distribute the 
spark.cores.max to JVMs and task ?

q3: Since the set-up of cores-per-worker restricts how many cores can be 
max avail per executor and since an executor cannot spawn across workers, 
what is the rationale behind an application claiming cores 
(spark.cores.max) as opposed to merely executors ? (This will make an app 
never fail to be admitted)

TIA for any clarifications/intuitions/experiences on the topic

best

Manolis.




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Multiplication on decimals in a dataframe query

2015-12-03 Thread Philip Dodds
Did a little more digging and it appears it was just the way I constructed
the Decimal

It works if you do

val data = Seq.fill(5) {
 Trade(Decimal(BigDecimal(5),38,20), Decimal(BigDecimal(5),38,20))
   }


On Thu, Dec 3, 2015 at 8:58 AM, Philip Dodds  wrote:

> Opened https://issues.apache.org/jira/browse/SPARK-12128
>
> Thanks
>
> P
>
> On Thu, Dec 3, 2015 at 8:51 AM, Philip Dodds 
> wrote:
>
>> I'll open up a JIRA for it,  it appears to work when you use a literal
>> number but not when it is coming from the same dataframe
>>
>> Thanks!
>>
>> P
>>
>> On Thu, Dec 3, 2015 at 1:52 AM, Sahil Sareen  wrote:
>>
>>> +1 looks like a bug
>>>
>>> I think referencing trades() twice in multiplication is broken,
>>>
>>> scala> trades.select(trades("quantity")*trades("quantity")).show
>>>
>>> +-+
>>> |(quantity * quantity)|
>>> +-+
>>> | null|
>>> | null|
>>>
>>> scala> sqlContext.sql("select price*price as PP from trades").show
>>>
>>> ++
>>> |  PP|
>>> ++
>>> |null|
>>> |null|
>>>
>>>
>>> -Sahil
>>>
>>> On Thu, Dec 3, 2015 at 12:02 PM, Akhil Das 
>>> wrote:
>>>
 Not quiet sure whats happening, but its not an issue with
 multiplication i guess as the following query worked for me:

 trades.select(trades("price")*9.5).show
 +-+
 |(price * 9.5)|
 +-+
 |199.5|
 |228.0|
 |190.0|
 |199.5|
 |190.0|
 |256.5|
 |218.5|
 |275.5|
 |218.5|
 ..
 ..


 Could it be with the precision? ccing dev list, may be you can open up
 a jira for this as it seems to be a bug.

 Thanks
 Best Regards

 On Mon, Nov 30, 2015 at 12:41 AM, Philip Dodds 
 wrote:

> I hit a weird issue when I tried to multiply to decimals in a select
> (either in scala or as SQL), and Im assuming I must be missing the point.
>
> The issue is fairly easy to recreate with something like the following:
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
>
> case class Trade(quantity: Decimal,price: Decimal)
>
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>
>   Trade(quantity, price)
> }
>
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
>
> trades.select(trades("price")*trades("quantity")).show
>
> sqlContext.sql("select
> price/quantity,price*quantity,price+quantity,price-quantity from
> trades").show
>
> The odd part is if you run it you will see that the addition/division
> and subtraction works but the multiplication returns a null.
>
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
>
> ie.
>
> +--+
>
> |(price * quantity)|
>
> +--+
>
> |  null|
>
> |  null|
>
> |  null|
>
> |  null|
>
> |  null|
>
> +--+
>
>
> +++++
>
> | _c0| _c1| _c2| _c3|
>
> +++++
>
> |0.952380952380952381|null|41.00...|-1.00...|
>
> |1.380952380952380952|null|50.00...|8.00|
>
> |1.272727272727272727|null|50.00...|6.00|
>
> |0.83|null|44.00...|-4.00...|
>
> |1.00|null|58.00...|   0E-18|
>
> +++++
>
>
> Just keen to know what I did wrong?
>
>
> Cheers
>
> P
>
> --
> Philip Dodds
>
>
>

>>>
>>
>>
>> --
>> Philip Dodds
>>
>> philip.do...@gmail.com
>> @philipdodds
>>
>>
>
>
> --
> Philip Dodds
>
> philip.do...@gmail.com
> @philipdodds
>
>


-- 
Philip Dodds

philip.do...@gmail.com
@philipdodds


Spark Streaming BackPressure and Custom Receivers

2015-12-03 Thread Deenar Toraskar
Hi

I was going through the Spark Streaming BackPressure feature documentation and
wanted to understand how I can ensure my custom receiver is able to handle
rate limiting. I have a custom receiver similar to the TwitterInputDStream,
but there is no obvious way to throttle what is being read from the source,
in response to rate limit events.

class MyReceiver(storageLevel: StorageLevel) extends
NetworkReceiver[String](storageLevel) {def onStart() {
// Setup stuff (start threads, open sockets, etc.) to start receiving data.
// Must start new thread to receive data, as onStart() must be non-blocking.

// Call store(...) in those threads to store received data into
Spark's memory.

// Call stop(...), restart(...) or reportError(...) on any thread
based on how
// different errors needs to be handled.

// See corresponding method documentation for more details
}
def onStop() {
// Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
}
}


http://spark.apache.org/docs/latest/streaming-custom-receivers.html

The back pressure design documentation states the following. I am unable to
figure out how this works for the TwitterInputDStream either. My receiver
is similar to the TwitterInputDStream one.

   -

   TwitterInputDStream
   - no changes required. The receiver-based mechanism will handle rate
  limiting


References https://issues.apache.org/jira/browse/SPARK-7398 and
https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit#

Regards
Deenar


Re: Does Spark streaming support iterative operator?

2015-12-03 Thread Sean Owen
Yes, in the sense that you can create and trigger an action on as many
RDDs created from the batch's RDD that you like.

On Thu, Dec 3, 2015 at 8:04 PM, Wang Yangjun  wrote:
> Hi,
>
> In storm we could do thing like:
>
> TopologyBuilder builder = new TopologyBuilder();
>
> builder.setSpout("spout", new NumberSpout());
> builder.setBolt(“mybolt", new Mybolt())
> .shuffleGrouping("spout")
> .shuffleGrouping(“mybolt", “iterativeStream");
>
> It means that after one operation there are two output streams. One of them
> will be a input of the operation. Seemly, Flink streaming also supports this
> feature -
> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#iterations.
>
> My question is does Spark streaming support this feature also? If yes, how?
> I couldn’t find it from the Internet.
>
> Thanks for your help.
> Jun
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does Spark streaming support iterative operator?

2015-12-03 Thread Wang Yangjun
Hi,

In storm we could do thing like:

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new NumberSpout());
builder.setBolt(“mybolt", new Mybolt())
.shuffleGrouping("spout")
.shuffleGrouping(“mybolt", “iterativeStream");

It means that after one operation there are two output streams. One of them 
will be a input of the operation. Seemly, Flink streaming also supports this 
feature - 
https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#iterations.

My question is does Spark streaming support this feature also? If yes, how? I 
couldn’t find it from the Internet.

Thanks for your help.
Jun



Re: Spark Streaming from S3

2015-12-03 Thread Michele Freschi
Hi Steve,

I¹m on hadoop 2.7.1 using the s3n

From:  Steve Loughran 
Date:  Thursday, December 3, 2015 at 4:12 AM
Cc:  SPARK-USERS 
Subject:  Re: Spark Streaming from S3


> On 3 Dec 2015, at 00:42, Michele Freschi  wrote:
> 
> Hi all,
> 
> I have an app streaming from s3 (textFileStream) and recently I've observed
> increasing delay and long time to list files:
> 
> INFO dstream.FileInputDStream: Finding new files took 394160 ms
> ...
> INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020 ms
> (execution: 10.154 s)
> 
> At this time I have about 13K files under the key prefix that I'm monitoring -
> hadoop takes about 6 minutes to list all the files while aws cli takes only
> seconds. 
> My understanding is that this is a current limitation of hadoop but I wanted
> to confirm it in case it's a misconfiguration on my part.

not a known issue.

Usual questions: which Hadoop version and are you using s3n or s3a
connectors. The latter does use the AWS sdk, but it's only been stable
enough to use in Hadoop 2.7

> 
> Some alternatives I'm considering:
> 1. copy old files to a different key prefix
> 2. use one of the available SQS receivers
> (https://github.com/imapi/spark-sqs-receiver
>  Dsqs-2Dreceiver=CwMFAg=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=YaCZ7
> nUd7TxXQA5k9sR42nen4K6AtCtNo0sEWlPw-9Y=N7cTMu7V05lQx-vlxpAWGgZP6jyut95v0PsO5
> hanXSw=q0awXD6YCk7xE1zbKXuKbqaQvuCf6_AE4g5C7g8Hq8Q=>  ?)
> 3. implement the s3 listing outside of spark and use socketTextStream, but I
> couldn't find if it's reliable or not
> 4. create a custom s3 receiver using aws sdk (even if doesn't look like it's
> possible to use them from pyspark)
> 
> Has anyone experienced the same issue and found a better way to solve it ?
> 
> Thanks,
> Michele
> 





smime.p7s
Description: S/MIME cryptographic signature


RE: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Mich Talebzadeh
Hi Marcelo.

So this is the approach I am going to take:

Use spark 1.3 pre-built
Use Hive 1.2.1. Do not copy over anything to add to hive libraries from spark 
1.3 libraries
Use Hadoop 2.6

There is no need to mess around with the libraries. I will try to unset my 
CLASSPATH and reset again and try again


Thanks,


Mich Talebzadeh

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com] 
Sent: 03 December 2015 18:45
To: Mich Talebzadeh 
Cc: u...@hive.apache.org; user 
Subject: Re: Any clue on this error, Exception in thread "main" 
java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

On Thu, Dec 3, 2015 at 10:32 AM, Mich Talebzadeh  wrote:

> hduser@rhes564::/usr/lib/spark/logs> hive --version
> SLF4J: Found binding in
> [jar:file:/usr/lib/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar!/org
> /slf4j/impl/StaticLoggerBinder.class]

As I suggested before, you have Spark's assembly in the Hive classpath. That's 
not the way to configure hive-on-spark; if the documentation you're following 
tells you to do that, it's wrong.

(And sorry Ted, but please ignore Ted's suggestion. Hive-on-Spark should work 
fine with Spark 1.3 if it's configured correctly. You really don't want to be 
overriding Hive classes with the ones shipped in the Spark assembly, regardless 
of the version of Spark being used.)

--
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming Running Out Of Memory in 1.5.0.

2015-12-03 Thread Augustus Hong
Hi All,

I'm running Spark Streaming (Python) with Direct Kafka and I'm seeing that
the memory usage will slowly go up and eventually kill the job in a few
days.

Everything runs fine at first but after a few days the job started issuing
*error: [Errno 104] Connection reset by peer ,  *followed by
*java.lang.OutOfMemoryError: GC overhead limit exceeded *when I tried to
access the web UI.

I'm not using any fancy settings, pretty much just the default, and give
each executor(4 cores) 14G of memory and 40G to the driver.

I looked through the mailing list and around the web. There were a few
streaming running out of memory issues but with no apparent solutions. If
anyone have insights into this please let me know!

Best,
Augustus

-

Detailed Logs below:

The first error I see is this:

15/12/02 19:05:03 INFO scheduler.DAGScheduler: Job 270661 finished: call at
>> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py:1206, took
>> 0.560671 s
>
> 15/12/02 19:05:09 ERROR python.PythonRDD: Error while sending iterator
>
> java.net.SocketTimeoutException: Accept timed out
>
>   at java.net.PlainSocketImpl.socketAccept(Native Method)
>
>   at
>> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
>
>   at java.net.ServerSocket.implAccept(ServerSocket.java:530)
>
>   at java.net.ServerSocket.accept(ServerSocket.java:498)
>
>   at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:613)
>
> Traceback (most recent call last):
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/streaming/util.py",
>> line 62, in call
>
> r = self.func(t, *rdds)
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py",
>> line 159, in 
>
> func = lambda t, rdd: old_func(rdd)
>
>   File "/root/spark-projects/click-flow/click-stream.py", line 141, in
>> 
>
> keys.foreachRDD(lambda rdd: rdd.foreachPartition(lambda part:
>> save_sets(part, KEY_SET_NAME, ITEMS_PER_KEY_LIMIT_KEYS)))
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 766, in
>> foreachPartition
>
> self.mapPartitions(func).count()  # Force evaluation
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1006, in
>> count
>
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 997, in
>> sum
>
> return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 871, in
>> fold
>
> vals = self.mapPartitions(func).collect()
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 774, in
>> collect
>
> return list(_load_from_socket(port, self._jrdd_deserializer))
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 142, in
>> _load_from_socket
>
> for item in serializer.load_stream(rf):
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
>> 139, in load_stream
>
> yield self._read_with_length(stream)
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
>> 156, in _read_with_length
>
> length = read_int(stream)
>
>   File "/root/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
>> 542, in read_int
>
> length = stream.read(4)
>
>   File "/usr/lib64/python2.6/socket.py", line 383, in read
>
> data = self._sock.recv(left)
>
> error: [Errno 104] Connection reset by peer
>
> 15/12/02 19:05:09 INFO scheduler.JobScheduler: Finished job streaming job
>> 144908232 ms.1 from job set of time 144908232 ms
>
>
>>
>>
And then when I try to access the web UI this error is thrown, leading me
to believe that it has something to do with memory being full:

15/12/02 19:43:34 WARN servlet.ServletHandler: Error for /jobs/
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64)
>   at java.lang.StringBuilder.(StringBuilder.java:97)
>   at scala.collection.mutable.StringBuilder.(StringBuilder.scala:46)
>   at scala.collection.mutable.StringBuilder.(StringBuilder.scala:51)
>   at scala.xml.Attribute$class.toString1(Attribute.scala:96)
>   at scala.xml.UnprefixedAttribute.toString1(UnprefixedAttribute.scala:16)
>   at scala.xml.MetaData.buildString(MetaData.scala:202)
>   at scala.xml.Utility$.serialize(Utility.scala:216)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.xml.Utility$.sequenceToXML(Utility.scala:256)
>   at scala.xml.Utility$.serialize(Utility.scala:227)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at 

Re: Does Spark streaming support iterative operator?

2015-12-03 Thread Wang Yangjun
Hi,

Thanks for your quick reply. Could you provide some pseudocode? It is a little 
hard to understand.

Thanks
Jun




On 03/12/15 22:16, "Sean Owen"  wrote:

>Yes, in the sense that you can create and trigger an action on as many
>RDDs created from the batch's RDD that you like.
>
>On Thu, Dec 3, 2015 at 8:04 PM, Wang Yangjun  wrote:
>> Hi,
>>
>> In storm we could do thing like:
>>
>> TopologyBuilder builder = new TopologyBuilder();
>>
>> builder.setSpout("spout", new NumberSpout());
>> builder.setBolt(“mybolt", new Mybolt())
>> .shuffleGrouping("spout")
>> .shuffleGrouping(“mybolt", “iterativeStream");
>>
>> It means that after one operation there are two output streams. One of them
>> will be a input of the operation. Seemly, Flink streaming also supports this
>> feature -
>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#iterations.
>>
>> My question is does Spark streaming support this feature also? If yes, how?
>> I couldn’t find it from the Internet.
>>
>> Thanks for your help.
>> Jun
>>
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka - streaming from multiple topics

2015-12-03 Thread Cody Koeninger
Yeah, that general plan should work, but might be a little awkward for
adding topicPartitions after the fact (i.e. when you have stored offsets
for some, but not all, of your topicpartitions)

Personally I just query kafka for the starting offsets if they dont exist
in the DB, using the methods in KafkaCluster.scala.

Yes, if you don't include a starting offset for a particular partition, it
will be ignored.

On Thu, Dec 3, 2015 at 3:31 PM, Dan Dutrow  wrote:

> Hey Cody, I'm convinced that I'm not going to get the functionality I want
> without using the Direct Stream API.
>
> I'm now looking through
> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md#exactly-once-using-transactional-writes
> where you say "For the very first time the job is run, the table can be
> pre-loaded with appropriate starting offsets."
>
> Could you provide some guidance on how to determine valid starting offsets
> the very first time, particularly in my case where I have 10+ topics in
> multiple different deployment environments with an unknown and potentially
> dynamic number of partitions per topic per environment?
>
> I'd be happy if I could initialize all consumers to the value of 
> *auto.offset.reset
> = "largest"*, record the partitions and offsets as they flow through
> spark, and then use those discovered offsets from thereon out.
>
> I'm thinking I can probably just do some if/else logic and use the basic
> createDirectStream and the more advanced
> createDirectStream(...fromOffsets...) if the offsets for my topic name
> exists in the database. Any reason that wouldn't work? If I don't include
> an offset range for a particular partition, will that partition be ignored?
>
>
>
>
> On Wed, Dec 2, 2015 at 3:17 PM Cody Koeninger  wrote:
>
>> Use the direct stream.  You can put multiple topics in a single stream,
>> and differentiate them on a per-partition basis using the offset range.
>>
>> On Wed, Dec 2, 2015 at 2:13 PM, dutrow  wrote:
>>
>>> I found the JIRA ticket:
>>> https://issues.apache.org/jira/browse/SPARK-2388
>>>
>>> It was marked as invalid.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>> --
> Dan ✆
>


SparkR in Spark 1.5.2 jsonFile Bug Found

2015-12-03 Thread tomasr3
Hello,

I believe to have encountered a bug with Spark 1.5.2. I am using RStudio and
SparkR to read in JSON files with jsonFile(sqlContext, "path"). If "path" is
a single path (e.g., "/path/to/dir0"), then it works fine;

but, when "path" is a vector of paths (e.g.

path <- c("/path/to/dir1","/path/to/dir2"), then I get the following error
message:

> raw.terror<-jsonFile(sqlContext,path)
15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2

Note that passing a vector of paths in Spark-1.4.1 works just fine. Any help
is greatly appreciated if this is not a bug and perhaps an environment or
different issue. 

Best,
T



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-Spark-1-5-2-jsonFile-Bug-Found-tp25560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL - Reading HCatalog Table

2015-12-03 Thread Sandip Mehta
Hi All,

I have a table created in Hive and stored/read using HCatalog. Table is in ORC 
format. I want to read this table in Spark SQL and do the join with RDDs.  How 
can i connect to HCatalog and get data from Spark SQL?

SM



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



consumergroup not working

2015-12-03 Thread Hudong Wang
Hi, I am trying to read data from kafka in zookeeper mode with following code.
val kafkaParams = Map[String, String] (  "zookeeper.connect" -> 
zookeeper,  "metadata.broker.list" -> brokers,  "group.id" -> 
consumerGroup,  "auto.offset.reset" -> autoOffsetReset)
return KafkaUtils.createStream[String, String, 
kafka.serializer.StringDecoder, kafka.serializer.StringDecoder](ssc,
kafkaParams,Map(topic -> 1),
StorageLevel.MEMORY_AND_DISK_SER_2)
I specified the group.id but in zookeeper I don't see my consumer group, though 
I see other people's consumer groups. 
Did I do something wrong in my code? Is there any way I can troubleshoot?
Thanks,Tony   

newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-03 Thread Andy Davidson
About 2 months ago I used spark-ec2 to set up a small cluster. The cluster
runs a spark streaming app 7x24 and stores the data to hdfs. I also need to
run some batch analytics on the data.

Now that I have a little more experience I wonder if this was a good way to
set up the cluster the following issues
1. I have not been able to find explicit directions for upgrading the spark
version
> 1. 
> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2=Re+Upgrading+Spark+in+EC2+cl
> usters
2. I am not sure where the data is physically be stored. I think I may
accidentally loose all my data
3. spark-ec2 makes it easy to launch a cluster with as many machines as you
like how ever Its not clear how I would add slaves to an existing
installation

Our Java streaming app we call rdd.saveAsTextFile(³hdfs://path²);

ephemeral-hdfs/conf/hdfs-site.xml:

  

dfs.data.dir

/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data

  



persistent-hdfs/conf/hdfs-site.xml



$ mount

/dev/xvdb on /mnt type ext3 (rw,nodiratime)

/dev/xvdf on /mnt2 type ext3 (rw,nodiratime)



http://spark.apache.org/docs/latest/ec2-scripts.html


"The spark-ec2 script also supports pausing a cluster. In this case, the VMs
are stopped but not terminated, so they lose all data on ephemeral disks but
keep the data in their root partitions and their persistent-pdfs.²


Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy to
use. I incorrectly thought spark some how knew how HDFS partitioned my data.

I think many people are using amazon s3. I do not have an direct experience
with S3. My concern would be that the data is not physically stored closed
to my slaves. I.e. High communication costs.

Any suggestions would be greatly appreciated

Andy




Re: How and where to update release notes for spark rel 1.6?

2015-12-03 Thread Andy Davidson
Hi JB

Do you know where I can find instructions for upgrading an existing
installation? I search the link you provided for ³update² and ³upgrade²

Kind regards

Andy

From:  Jean-Baptiste Onofré 
Date:  Thursday, December 3, 2015 at 5:29 AM
To:  "user @spark" 
Subject:  Re: How and where to update release notes for spark rel 1.6?

> Hi Ravi,
> 
> Even if it's not perfect, you can take a look on the current
> ReleaseNotes on Jira:
> 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420
> ion=12333083
> 
> Regards
> JB
> 
> On 12/03/2015 12:01 PM, RaviShankar KS wrote:
>>  Hi,
>> 
>>  How and where to update release notes for spark rel 1.6?
>>  pls help.
>> 
>>  There are a few methods with changed params, and a few deprecated ones
>>  that need to be documented.
>> 
>>  Thanks
>>  Ravi
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 




Re: Local mode: Stages hang for minutes

2015-12-03 Thread Richard Marscher
Ended up realizing I was only looking at the call tree for running threads.
After looking at blocking threads I saw that it was spending hundreds of
compute hours blocking on jets3t calls to S3. Realized it was looking over
likely thousands if not hundreds of thousands of S3 files accumulated over
many rounds of load testing. Cleaning the files fixed the issue and I'm
pretty sure it's already well known that the underlying s3n doesn't handle
traversing a large s3 file tree with the sparkContext.textFile function
using wildcard well.

On Thu, Dec 3, 2015 at 12:57 PM, Ali Tajeldin EDU 
wrote:

> You can try to run "jstack" a couple of times while the app is hung to
> look for patterns  for where the app is hung.
> --
> Ali
>
>
> On Dec 3, 2015, at 8:27 AM, Richard Marscher 
> wrote:
>
> I should add that the pauses are not from GC and also in tracing the CPU
> call tree in the JVM it seems like nothing is doing any work, just seems to
> be idling or blocking.
>
> On Thu, Dec 3, 2015 at 11:24 AM, Richard Marscher <
> rmarsc...@localytics.com> wrote:
>
>> Hi,
>>
>> I'm doing some testing of workloads using local mode on a server. I see
>> weird behavior where a job is submitted to the application and it just
>> hangs for several minutes doing nothing. The stages are submitted as
>> pending and in the application UI the stage view claims no tasks have been
>> submitted. Suddenly after a few minutes things suddenly start and run
>> smoothly.
>>
>> I'm running against tiny data sets the size of 10s to low 100s of items
>> in the RDD. I've been attaching with JProfiler, doing thread and heap dumps
>> but nothing is really standing out as to why Spark seems to periodically
>> pause for such a long time.
>>
>> Has anyone else seen similar behavior or aware of some quirk of local
>> mode that could cause this kind of blocking?
>>
>> --
>> *Richard Marscher*
>> Software Engineer
>> Localytics
>> Localytics.com  | Our Blog
>>  | Twitter  |
>> Facebook  | LinkedIn
>> 
>>
>
>
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>
>
>


-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



Spark java.lang.SecurityException: class “javax.servlet.FilterRegistration”' with sbt

2015-12-03 Thread Moises Baly
Hi all,

I'm having issues with javax.servlet when running simple Spark jobs. I'm
using Scala + sbt and found a solution for this error: problem is, this
particular solution is not working when running tests. Any idea how can I
exclude all conflicted dependencies for all scopes? Here is my partial
 build.sbt.

scalaVersion := "2.11.7"


//Library repositories
resolvers ++= Seq(
  Resolver.mavenLocal,
  "Scala-Tools Maven2 Repository" at "http://scala-tools.org/repo-releases;,
  "Java.net repository" at "http://download.java.net/maven/2;,
  "GeoTools" at "http://download.osgeo.org/webdav/geotools;,
  "Apache" at 
"https://repository.apache.org/service/local/repositories/releases/content;,
  "Cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/;,
  "OpenGeo Maven Repository" at "http://repo.opengeo.org;,
  "Typesafe" at "https://repo.typesafe.com/typesafe/releases/;,
  "Spray Repository" at "http://repo.spray.io;
)

//Library versions
val geotools_version = "13.2"
val accumulo_version = "1.6.0-cdh5.1.4"
val hadoop_version = "2.6.0-cdh5.4.7"
val hadoop_client_version = "2.6.0-mr1-cdh5.4.7"
val geowave_version = "0.9.0-SNAPSHOT"
val akka_version = "2.4.0"
val spray_version = "1.3.3"
val spark_version = "1.5.2"

/**  DEV
/
//Library Dependencies for dev
libraryDependencies ++= Seq(
  //Scala
  "org.scala-lang" % "scala-library-all" % scalaVersion.value,

  //GeoTools
  "org.geotools" % "gt-data" % geotools_version,
  "org.geotools" % "gt-geojson" % geotools_version,

  //Apache
  "org.apache.accumulo" % "accumulo-core" % accumulo_version,

  //Geowave
  "mil.nga.giat" % "geowave-core-store" % geowave_version,
  "mil.nga.giat" % "geowave-datastore-accumulo" % geowave_version,
  "mil.nga.giat" % "geowave-adapter-vector" % geowave_version,

  //Other
  "com.typesafe" % "config" % "1.3.0",

  //Spray - Akka
  "com.typesafe.akka" %% "akka-actor" % akka_version,

  "io.spray" %% "spray-can" % spray_version,
  "io.spray" %% "spray-routing" % spray_version,
  "com.typesafe.play" %% "play-json" % "2.5.0-M1"
exclude("com.fasterxml.jackson.core", "jackson-databind"),

  //Spark
  "org.apache.spark" %% "spark-core" % spark_version,

  //Testing
  "org.scalatest" % "scalatest_2.11" % "2.2.4" % "test"
).map(
  _.excludeAll(
ExclusionRule(organization = "javax.servlet")
  )
)

test in assembly := {}

Thank you for your help,

Moises


Re: Kafka - streaming from multiple topics

2015-12-03 Thread Dan Dutrow
Hey Cody, I'm convinced that I'm not going to get the functionality I want
without using the Direct Stream API.

I'm now looking through
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md#exactly-once-using-transactional-writes
where you say "For the very first time the job is run, the table can be
pre-loaded with appropriate starting offsets."

Could you provide some guidance on how to determine valid starting offsets
the very first time, particularly in my case where I have 10+ topics in
multiple different deployment environments with an unknown and potentially
dynamic number of partitions per topic per environment?

I'd be happy if I could initialize all consumers to the value of
*auto.offset.reset
= "largest"*, record the partitions and offsets as they flow through spark,
and then use those discovered offsets from thereon out.

I'm thinking I can probably just do some if/else logic and use the basic
createDirectStream and the more advanced
createDirectStream(...fromOffsets...) if the offsets for my topic name
exists in the database. Any reason that wouldn't work? If I don't include
an offset range for a particular partition, will that partition be ignored?




On Wed, Dec 2, 2015 at 3:17 PM Cody Koeninger  wrote:

> Use the direct stream.  You can put multiple topics in a single stream,
> and differentiate them on a per-partition basis using the offset range.
>
> On Wed, Dec 2, 2015 at 2:13 PM, dutrow  wrote:
>
>> I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388
>>
>> It was marked as invalid.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>> --
Dan ✆


Re: Spark Streaming Running Out Of Memory in 1.5.0.

2015-12-03 Thread Ted Yu
bq. lambda part: save_sets(part, KEY_SET_NAME,

Where do you save the part to ?

For OutOfMemoryError, the last line was from Utility.scala
Anything before that ?

Thanks

On Thu, Dec 3, 2015 at 11:47 AM, Augustus Hong 
wrote:

> Hi All,
>
> I'm running Spark Streaming (Python) with Direct Kafka and I'm seeing that
> the memory usage will slowly go up and eventually kill the job in a few
> days.
>
> Everything runs fine at first but after a few days the job started issuing
> *error: [Errno 104] Connection reset by peer ,  *followed by
> *java.lang.OutOfMemoryError: GC overhead limit exceeded *when I tried to
> access the web UI.
>
> I'm not using any fancy settings, pretty much just the default, and give
> each executor(4 cores) 14G of memory and 40G to the driver.
>
> I looked through the mailing list and around the web. There were a few
> streaming running out of memory issues but with no apparent solutions. If
> anyone have insights into this please let me know!
>
> Best,
> Augustus
>
>
> -
>
> Detailed Logs below:
>
> The first error I see is this:
>
> 15/12/02 19:05:03 INFO scheduler.DAGScheduler: Job 270661 finished: call
>>> at /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py:1206,
>>> took 0.560671 s
>>
>> 15/12/02 19:05:09 ERROR python.PythonRDD: Error while sending iterator
>>
>> java.net.SocketTimeoutException: Accept timed out
>>
>>   at java.net.PlainSocketImpl.socketAccept(Native Method)
>>
>>   at
>>> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
>>
>>   at java.net.ServerSocket.implAccept(ServerSocket.java:530)
>>
>>   at java.net.ServerSocket.accept(ServerSocket.java:498)
>>
>>   at
>>> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:613)
>>
>> Traceback (most recent call last):
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/streaming/util.py",
>>> line 62, in call
>>
>> r = self.func(t, *rdds)
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py",
>>> line 159, in 
>>
>> func = lambda t, rdd: old_func(rdd)
>>
>>   File "/root/spark-projects/click-flow/click-stream.py", line 141, in
>>> 
>>
>> keys.foreachRDD(lambda rdd: rdd.foreachPartition(lambda part:
>>> save_sets(part, KEY_SET_NAME, ITEMS_PER_KEY_LIMIT_KEYS)))
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 766, in
>>> foreachPartition
>>
>> self.mapPartitions(func).count()  # Force evaluation
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1006, in
>>> count
>>
>> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 997, in
>>> sum
>>
>> return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 871, in
>>> fold
>>
>> vals = self.mapPartitions(func).collect()
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 774, in
>>> collect
>>
>> return list(_load_from_socket(port, self._jrdd_deserializer))
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 142, in
>>> _load_from_socket
>>
>> for item in serializer.load_stream(rf):
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
>>> 139, in load_stream
>>
>> yield self._read_with_length(stream)
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
>>> 156, in _read_with_length
>>
>> length = read_int(stream)
>>
>>   File "/root/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
>>> 542, in read_int
>>
>> length = stream.read(4)
>>
>>   File "/usr/lib64/python2.6/socket.py", line 383, in read
>>
>> data = self._sock.recv(left)
>>
>> error: [Errno 104] Connection reset by peer
>>
>> 15/12/02 19:05:09 INFO scheduler.JobScheduler: Finished job streaming job
>>> 144908232 ms.1 from job set of time 144908232 ms
>>
>>
>>>
>>>
> And then when I try to access the web UI this error is thrown, leading me
> to believe that it has something to do with memory being full:
>
> 15/12/02 19:43:34 WARN servlet.ServletHandler: Error for /jobs/
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>   at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64)
>>   at java.lang.StringBuilder.(StringBuilder.java:97)
>>   at scala.collection.mutable.StringBuilder.(StringBuilder.scala:46)
>>   at scala.collection.mutable.StringBuilder.(StringBuilder.scala:51)
>>   at scala.xml.Attribute$class.toString1(Attribute.scala:96)
>>   at scala.xml.UnprefixedAttribute.toString1(UnprefixedAttribute.scala:16)
>>   at scala.xml.MetaData.buildString(MetaData.scala:202)
>>   at scala.xml.Utility$.serialize(Utility.scala:216)
>>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>>   at 

sparkavro for PySpark 1.3

2015-12-03 Thread YaoPau
How can I read from and write to Avro using PySpark in 1.3?  

I can only find the  1.4 documentation
  , which
uses a sqlContext.read method that isn't available to me in 1.3.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sparkavro-for-PySpark-1-3-tp25561.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming Specify Kafka Partition

2015-12-03 Thread Alan Braithwaite
One quick newbie question since I got another chance to look at this
today.  We're using java for our spark applications.  The createDirectStream
we were using previously [1] returns a JavaPairInputDStream, but the
createDirectStream with fromOffsets expects an argument recordClass to pass
into the generic constructor for createDirectStream.

In the code for the first function signature (without fromOffsets) it's
being constructed in Scala as just a tuple (K, V).   How do I pass this
same class/type information from java as the record class to get a
JavaPairInputDStream?

I understand this might be a question more fit for a scala mailing list but
google is failing me at the moment for hints on the interoperability of
scala and java generics.

[1] The original createDirectStream:
https://github.com/apache/spark/blob/branch-1.5/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala#L395-L423

Thanks,
- Alan

On Tue, Dec 1, 2015 at 8:12 AM, Cody Koeninger  wrote:

> I actually haven't tried that, since I tend to do the offset lookups if
> necessary.
>
> It's possible that it will work, try it and let me know.
>
> Be aware that if you're doing a count() or take() operation directly on
> the rdd it'll definitely give you the wrong result if you're using -1 for
> one of the offsets.
>
>
>
> On Tue, Dec 1, 2015 at 9:58 AM, Alan Braithwaite 
> wrote:
>
>> Neat, thanks.  If I specify something like -1 as the offset, will it
>> consume from the latest offset or do I have to instrument that manually?
>>
>> - Alan
>>
>> On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger 
>> wrote:
>>
>>> Yes, there is a version of createDirectStream that lets you specify
>>> fromOffsets: Map[TopicAndPartition, Long]
>>>
>>> On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite 
>>> wrote:
>>>
 Is there any mechanism in the kafka streaming source to specify the
 exact partition id that we want a streaming job to consume from?

 If not, is there a workaround besides writing our a custom receiver?

 Thanks,
 - Alan

>>>
>>>
>>
>


Re: Problem with RDD of (Long, Byte[Array])

2015-12-03 Thread Josh Rosen
Are they keys that you're joining on the bye arrays themselves? If so,
that's not likely to work because of how Java computes arrays' hashCodes;
see https://issues.apache.org/jira/browse/SPARK-597. If this turns out to
be the problem, we should look into strengthening the checks for array-type
keys in order to detect and fail fast for this join() case.

On Thu, Dec 3, 2015 at 8:58 AM, Hervé Yviquel  wrote:

> Hi all,
>
> I have problem when using Array[Byte] in RDD operation.
> When I join two different RDDs of type [(Long, Array[Byte])], I obtain
> wrong results... But if I translate the byte array in integer and join two
> different RDDs of type [(Long, Integer)], then the results is correct...
> Any idea ?
>
> --
> The code:
>
> val byteRDD0 = sc.binaryRecords(path_arg0, 4).zipWithIndex.map{x => (x._2,
> x._1)}
> val byteRDD1 = sc.binaryRecords(path_arg1, 4).zipWithIndex.map{x => (x._2,
> x._1)}
>
> byteRDD0.foreach{x => println("BYTE0 " + x._1 + "=> "
> +ByteBuffer.wrap(x._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
> byteRDD1.foreach{x => println("BYTE1 " + x._1 + "=> "
> +ByteBuffer.wrap(x._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
>
> val intRDD1 = byteRDD1.mapValues{x=>
> ByteBuffer.wrap(x).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt()}
> val intRDD2 = byteRDD2.mapValues{x=>
> ByteBuffer.wrap(x).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt()}
>
> val byteJOIN = byteRDD1.join(byteRDD2)
> byteJOIN.foreach{x => println("BYTEJOIN " + x._1 + "=> " +
> ByteBuffer.wrap(x._2._1).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt() +
> " -
> "+ByteBuffer.wrap(x._2._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
>
> val intJOIN = intRDD1.join(intRDD2)
> intJOIN.foreach{x => println("INTJOIN " + x._1 + "=> " + x._2._1 + " - "+
> x._2._2)}
>
>
> --
> stdout:
>
> BYTE0 0=> 1
> BYTE0 1=> 3
> BYTE0 2=> 5
> BYTE0 3=> 7
> BYTE0 4=> 9
> BYTE0 5=> 11
> BYTE0 6=> 13
> BYTE0 7=> 15
> BYTE0 8=> 17
> BYTE0 9=> 19
> BYTE0 10=> 21
> BYTE0 11=> 23
> BYTE0 12=> 25
> BYTE0 13=> 27
> BYTE0 14=> 29
> BYTE0 15=> 31
> BYTE0 16=> 33
> BYTE0 17=> 35
> BYTE0 18=> 37
> BYTE0 19=> 39
> BYTE0 20=> 41
> BYTE0 21=> 43
> BYTE0 22=> 45
> BYTE0 23=> 47
> BYTE0 24=> 49
> BYTE0 25=> 51
> BYTE0 26=> 53
> BYTE0 27=> 55
> BYTE0 28=> 57
> BYTE0 29=> 59
> BYTE1 0=> 0
> BYTE1 1=> 1
> BYTE1 2=> 2
> BYTE1 3=> 3
> BYTE1 4=> 4
> BYTE1 5=> 5
> BYTE1 6=> 6
> BYTE1 7=> 7
> BYTE1 8=> 8
> BYTE1 9=> 9
> BYTE1 10=> 10
> BYTE1 11=> 11
> BYTE1 12=> 12
> BYTE1 13=> 13
> BYTE1 14=> 14
> BYTE1 15=> 15
> BYTE1 16=> 16
> BYTE1 17=> 17
> BYTE1 18=> 18
> BYTE1 19=> 19
> BYTE1 20=> 20
> BYTE1 21=> 21
> BYTE1 22=> 22
> BYTE1 23=> 23
> BYTE1 24=> 24
> BYTE1 25=> 25
> BYTE1 26=> 26
> BYTE1 27=> 27
> BYTE1 28=> 28
> BYTE1 29=> 29
> BYTEJOIN 13=> 1 - 0
> BYTEJOIN 19=> 1 - 0
> BYTEJOIN 15=> 1 - 0
> BYTEJOIN 4=> 1 - 0
> BYTEJOIN 21=> 1 - 0
> BYTEJOIN 16=> 1 - 0
> BYTEJOIN 22=> 1 - 0
> BYTEJOIN 25=> 1 - 0
> BYTEJOIN 28=> 1 - 0
> BYTEJOIN 29=> 1 - 0
> BYTEJOIN 11=> 1 - 0
> BYTEJOIN 14=> 1 - 0
> BYTEJOIN 27=> 1 - 0
> BYTEJOIN 0=> 1 - 0
> BYTEJOIN 24=> 1 - 0
> BYTEJOIN 23=> 1 - 0
> BYTEJOIN 1=> 1 - 0
> BYTEJOIN 6=> 1 - 0
> BYTEJOIN 17=> 1 - 0
> BYTEJOIN 3=> 1 - 0
> BYTEJOIN 7=> 1 - 0
> BYTEJOIN 9=> 1 - 0
> BYTEJOIN 8=> 1 - 0
> BYTEJOIN 12=> 1 - 0
> BYTEJOIN 18=> 1 - 0
> BYTEJOIN 20=> 1 - 0
> BYTEJOIN 26=> 1 - 0
> BYTEJOIN 10=> 1 - 0
> BYTEJOIN 5=> 1 - 0
> BYTEJOIN 2=> 1 - 0
> INTJOIN 13=> 27 - 13
> INTJOIN 19=> 39 - 19
> INTJOIN 15=> 31 - 15
> INTJOIN 4=> 9 - 4
> INTJOIN 21=> 43 - 21
> INTJOIN 16=> 33 - 16
> INTJOIN 22=> 45 - 22
> INTJOIN 25=> 51 - 25
> INTJOIN 28=> 57 - 28
> INTJOIN 29=> 59 - 29
> INTJOIN 11=> 23 - 11
> INTJOIN 14=> 29 - 14
> INTJOIN 27=> 55 - 27
> INTJOIN 0=> 1 - 0
> INTJOIN 24=> 49 - 24
> INTJOIN 23=> 47 - 23
> INTJOIN 1=> 3 - 1
> INTJOIN 6=> 13 - 6
> INTJOIN 17=> 35 - 17
> INTJOIN 3=> 7 - 3
> INTJOIN 7=> 15 - 7
> INTJOIN 9=> 19 - 9
> INTJOIN 8=> 17 - 8
> INTJOIN 12=> 25 - 12
> INTJOIN 18=> 37 - 18
> INTJOIN 20=> 41 - 20
> INTJOIN 26=> 53 - 26
> INTJOIN 10=> 21 - 10
> INTJOIN 5=> 11 - 5
> INTJOIN 2=> 5 - 2
>
>
> Thanks,
> Hervé
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Marcelo Vanzin
(bcc: user@spark, since this is Hive code.)

You're probably including unneeded Spark jars in Hive's classpath
somehow. Either the whole assembly or spark-hive, both of which will
contain Hive classes, and in this case contain old versions that
conflict with the version of Hive you're running.

On Thu, Dec 3, 2015 at 9:54 AM, Mich Talebzadeh  wrote:
> Trying to run Hive on Spark 1.3 engine, I get
>
>
>
> conf hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>
> 15/12/03 17:53:18 [stderr-redir-1]: INFO client.SparkClientImpl: Spark
> assembly has been built with Hive, including Datanucleus jars on classpath
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.connect.timeout=1000
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.rpc.threads=8
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.rpc.max.size=52428800
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.secret.bits=256
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property:
> hive.spark.client.server.connect.timeout=9
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: 15/12/03
> 17:53:19 INFO client.RemoteDriver: Connecting to: rhes564:36577
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Exception
> in thread "main" java.lang.NoSuchFieldError:
> SPARK_RPC_CLIENT_CONNECT_TIMEOUT
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:46)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:146)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> java.lang.reflect.Method.invoke(Method.java:606)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
> Any clues?
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> Sybase ASE 15 Gold Medal Award 2008
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
> ISBN 978-0-9563693-0-7.
>
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
>
> Publications due shortly:
>
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
>
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>



-- 
Marcelo


RE: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Mich Talebzadeh
Thanks I tried all :(

 

I am trying to make Hive use Spark and apparently Hive can use version 1.3 of 
Spark as execution engine. Frankly I don’t know why this is not working!

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Furcy Pin [mailto:furcy@flaminem.com] 
Sent: 03 December 2015 18:07
To: u...@hive.apache.org
Cc: user@spark.apache.org
Subject: Re: Any clue on this error, Exception in thread "main" 
java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

 

maybe you compile and run against different versions of spark?

 

On Thu, Dec 3, 2015 at 6:54 PM, Mich Talebzadeh  > wrote:

Trying to run Hive on Spark 1.3 engine, I get

 

conf hive.spark.client.channel.log.level=null --conf 
hive.spark.client.rpc.max.size=52428800 --conf hive.spark.client.rpc.threads=8 
--conf hive.spark.client.secret.bits=256

15/12/03 17:53:18 [stderr-redir-1]: INFO client.SparkClientImpl: Spark assembly 
has been built with Hive, including Datanucleus jars on classpath

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
Ignoring non-spark config property: hive.spark.client.connect.timeout=1000

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
Ignoring non-spark config property: hive.spark.client.rpc.threads=8

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
Ignoring non-spark config property: hive.spark.client.rpc.max.size=52428800

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
Ignoring non-spark config property: hive.spark.client.secret.bits=256

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
Ignoring non-spark config property: 
hive.spark.client.server.connect.timeout=9

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: 15/12/03 
17:53:19 INFO client.RemoteDriver: Connecting to: rhes564:36577

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Exception in 
thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:46)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:146)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
java.lang.reflect.Method.invoke(Method.java:606)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at 

how to spark streaming application start working on next batch before completing on previous batch .

2015-12-03 Thread prateek arora
Hi

I am using spark streaming with Kafka.  spark version is 1.5.0 and batch
interval  is 1 sec.

In my scenario , algorithm take 7-10 sec to process 1 batch period data. so
after completing previous batch 
, spark streaming start processing on next batch.

i want that my spark streaming application start working on next batch
before completing on previous batch . means batches  will execute in
parallel.

please help me to solve this problem.

Regards
Prateek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-spark-streaming-application-start-working-on-next-batch-before-completing-on-previous-batch-tp25559.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



jdbc error, ClassNotFoundException: org.apache.hadoop.hive.schshim.FairSchedulerShim

2015-12-03 Thread zhangjp
Hi all,
  I download the prebuilt version 1.5.2 with hadoop 2.6, when I use spark-sql 
there is no problem, but when i start thriftServer and then want to query hive 
table useing jdbc  there is errors as follows.
  
 Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.schshim.FairSchedulerShim
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:195)
at 
org.apache.hadoop.hive.shims.ShimLoader.createShim(ShimLoader.java:146)

Re: SparkR in Spark 1.5.2 jsonFile Bug Found

2015-12-03 Thread Felix Cheung
It looks like this has been broken around Spark 1.5.
Please see JIRA SPARK-10185. This has been fixed in pyspark but unfortunately 
SparkR was missed. I have confirmed this is still broken in Spark 1.6.
Could you please open a JIRA?






On Thu, Dec 3, 2015 at 2:08 PM -0800, "tomasr3" 
 wrote:





Hello,

I believe to have encountered a bug with Spark 1.5.2. I am using RStudio and
SparkR to read in JSON files with jsonFile(sqlContext, "path"). If "path" is
a single path (e.g., "/path/to/dir0"), then it works fine;

but, when "path" is a vector of paths (e.g.

path <- c("/path/to/dir1","/path/to/dir2"), then I get the following error
message:

> raw.terror<-jsonFile(sqlContext,path)
15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2

Note that passing a vector of paths in Spark-1.4.1 works just fine. Any help
is greatly appreciated if this is not a bug and perhaps an environment or
different issue.

Best,
T



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-Spark-1-5-2-jsonFile-Bug-Found-tp25560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark master - run-tests error

2015-12-03 Thread wei....@kaiyuandao.com

hi, is there anyone knowing why I came to the following error when running 
tests after a successful full build? thanks

[root@sandbox spark_git]# dev/run-tests
**
File "./dev/run-tests.py", line 68, in 
__main__.identify_changed_files_from_git_commits
Failed example:
[x.name for x in determine_modules_for_files( 
identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
  File "", 
line 1, in 
[x.name for x in determine_modules_for_files( 
identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
  File "./dev/run-tests.py", line 84, in 
identify_changed_files_from_git_commits
raw_output = subprocess.check_output(['git', 'diff', '--name-only', 
patch_sha, diff_target],
AttributeError: 'module' object has no attribute 'check_output'
**
File "./dev/run-tests.py", line 70, in 
__main__.identify_changed_files_from_git_commits
Failed example:
'root' in [x.name for x in determine_modules_for_files(  
identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
  File "", 
line 1, in 
'root' in [x.name for x in determine_modules_for_files(  
identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]



Re: spark master - run-tests error

2015-12-03 Thread Ted Yu
The commit on last line led to:

commit 50a0496a43f09d70593419efc38587c8441843bf
Author: Brennon York 
Date:   Wed Jun 17 12:00:34 2015 -0700

When did you last update your workspace ?

Cheers

On Thu, Dec 3, 2015 at 6:09 PM, wei@kaiyuandao.com <
wei@kaiyuandao.com> wrote:

>
> hi, is there anyone knowing why I came to the following error when running
> tests after a successful full build? thanks
>
> [root@sandbox spark_git]# dev/run-tests
> **
>
> File "./dev/run-tests.py", line 68, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name
>  for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
> compileflags, 1) in test.globs
>
>   File "", 
> line 1, in 
> [x.name
>  for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
>
>   File "./dev/run-tests.py", line 84, in 
> identify_changed_files_from_git_commits
>
> raw_output = subprocess.check_output(['git', 'diff', '--name-only', 
> patch_sha, diff_target],
> AttributeError: 'module' object has no attribute 'check_output'
> **
>
> File "./dev/run-tests.py", line 70, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> 'root' in [x.name
>  for x in determine_modules_for_files(  
> identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
> compileflags, 1) in test.globs
>
>   File "", 
> line 1, in 
> 'root' in [x.name
>  for x in determine_modules_for_files(  
> identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]
>
>


Re: Re: spark master - run-tests error

2015-12-03 Thread wei....@kaiyuandao.com
I was using the latest master branch. 



 
From: Ted Yu
Date: 2015-12-04 10:14
To: wei@kaiyuandao.com
CC: user
Subject: Re: spark master - run-tests error
The commit on last line led to:

commit 50a0496a43f09d70593419efc38587c8441843bf
Author: Brennon York 
Date:   Wed Jun 17 12:00:34 2015 -0700

When did you last update your workspace ?

Cheers

On Thu, Dec 3, 2015 at 6:09 PM, wei@kaiyuandao.com  
wrote:

hi, is there anyone knowing why I came to the following error when running 
tests after a successful full build? thanks

[root@sandbox spark_git]# dev/run-tests
**
File "./dev/run-tests.py", line 68, in 
__main__.identify_changed_files_from_git_commits
Failed example:
[x.name for x in determine_modules_for_files( 
identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
  File "", 
line 1, in 
[x.name for x in determine_modules_for_files( 
identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
  File "./dev/run-tests.py", line 84, in 
identify_changed_files_from_git_commits
raw_output = subprocess.check_output(['git', 'diff', '--name-only', 
patch_sha, diff_target],
AttributeError: 'module' object has no attribute 'check_output'
**
File "./dev/run-tests.py", line 70, in 
__main__.identify_changed_files_from_git_commits
Failed example:
'root' in [x.name for x in determine_modules_for_files(  
identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
  File "", 
line 1, in 
'root' in [x.name for x in determine_modules_for_files(  
identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]




Re: spark master - run-tests error

2015-12-03 Thread Ted Yu
>From dev/run-tests.py :

def identify_changed_files_from_git_commits(patch_sha, target_branch=None,
target_ref=None):
"""
Given a git commit and target ref, use the set of files changed in the
diff in order to
determine which modules' tests should be run.

Looks like the script needs git commit to determine which modules test
suite should be run.

I use the following command to run tests:

mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0
package

FYI

On Thu, Dec 3, 2015 at 6:09 PM, wei@kaiyuandao.com <
wei@kaiyuandao.com> wrote:

>
> hi, is there anyone knowing why I came to the following error when running
> tests after a successful full build? thanks
>
> [root@sandbox spark_git]# dev/run-tests
> **
>
> File "./dev/run-tests.py", line 68, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name
>  for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
> compileflags, 1) in test.globs
>
>   File "", 
> line 1, in 
> [x.name
>  for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
>
>   File "./dev/run-tests.py", line 84, in 
> identify_changed_files_from_git_commits
>
> raw_output = subprocess.check_output(['git', 'diff', '--name-only', 
> patch_sha, diff_target],
> AttributeError: 'module' object has no attribute 'check_output'
> **
>
> File "./dev/run-tests.py", line 70, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> 'root' in [x.name
>  for x in determine_modules_for_files(  
> identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
> compileflags, 1) in test.globs
>
>   File "", 
> line 1, in 
> 'root' in [x.name
>  for x in determine_modules_for_files(  
> identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]
>
>


Spark Streaming Shuffle to Disk

2015-12-03 Thread Steven Pearson
I'm running a Spark Streaming job on 1.3.1 which contains an
updateStateByKey.  The job works perfectly fine, but at some point (after a
few runs), it starts shuffling to disk no matter how much memory I give the
executors.

I have tried changing --executor-memory on
spark-submit, spark.shuffle.memoryFraction, spark.storage.memoryFraction,
and spark.storage.unrollFraction.  But no matter how I configure these, it
always spills to disk around 2.5GB.

What is the best way to avoid spilling shuffle to disk?


Creating a dataframe with decimals changes the precision and scale

2015-12-03 Thread Philip Dodds
I'm not sure if there is a way around this just looking for advice,

I create a dataframe from some decimals with a specific precision and
scale,  then when I look at the dataframe it has defaulted the precision
and scale back again.

Is there a way to retain the precision and scale when doing a toDF()

example code:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.types.Decimal

val a = new Decimal().set(BigDecimal(50),14,4)
val b = new Decimal().set(BigDecimal(50),14,4)

val data = Seq.fill(5) {
 (a,b)
   }

val trades = data.toDF()

trades.printSchema()

the result of this code would show

root
 |-- _1: decimal(38,18) (nullable = true)
 |-- _2: decimal(38,18) (nullable = true)

sqlContext: org.apache.spark.sql.SQLContext =
org.apache.spark.sql.SQLContext@3a1a48f7
import sqlContext.implicits._
import org.apache.spark.sql.types.Decimal
a: org.apache.spark.sql.types.Decimal = 50.00
b: org.apache.spark.sql.types.Decimal = 50.00
data: Seq[(org.apache.spark.sql.types.Decimal,
org.apache.spark.sql.types.Decimal)] =
List((50.00,50.00),
(50.00,50.00),
(50.00,50.00),
(50.00,50.00),
(50.00,50.00))
trades: org.apache.spark.sql.DataFrame = [_1: decimal(38,18), _2:
decimal(38,18)]


Any advice would be brilliant


Thanks

P


-- 
Philip Dodds

philip.do...@gmail.com
@philipdodds


Re: Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread Sachin Aggarwal
Hi,

has anyone faced this error, is there any workaround to this issue?

thanks

On Thu, Dec 3, 2015 at 4:28 PM, Sahil Sareen  wrote:

> Attaching the JIRA as well for completeness:
> https://issues.apache.org/jira/browse/SPARK-12117
>
> On Thu, Dec 3, 2015 at 4:13 PM, Sachin Aggarwal <
> different.sac...@gmail.com> wrote:
>
>>
>> Hi All,
>>
>> need help guys, I need a work around for this situation
>>
>> *case where this works:*
>>
>> val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"),
>> ("Rishabh", "2"))).toDF("myText", "id")
>>
>> TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
>>
>>
>> steps to reproduce error case:
>> 1) create a file copy following text--filename(a.json)
>>
>> { "myText": "Sachin Aggarwal", "id": "1"}
>> { "myText": "Rishabh", "id": "2"}
>>
>> 2) define a simple UDF
>> def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}
>>
>> 3) register the udf
>>  sqlContext.udf.register("mydef" ,mydef _)
>>
>> 4) read the input file
>> val TestDoc2=sqlContext.read.json("/tmp/a.json")
>>
>> 5) make a call to UDF
>>
>> TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
>>
>> ERROR received:
>> java.lang.IllegalArgumentException: Field "Text" does not exist.
>>  at
>> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>>  at
>> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>>  at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
>>  at
>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
>>  at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
>>  at
>> org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
>>  at
>> $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
>>  at
>> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>>  at
>> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>>  at
>> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
>>  at
>> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>>  at
>> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>>  at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>> Source)
>>  at
>> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
>>  at
>> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
>>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>  at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>  at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>  at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>>  at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>>  at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>>  at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>>  at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>>  at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>>  at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>  at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177)
>>  at
>> 

Re: Problem with RDD of (Long, Byte[Array])

2015-12-03 Thread Hervé Yviquel
Hi Josh,

Thanks for the answer.
No, in my case, the byte arrays are the values... I use indexes generated by 
zipWithIndex as the keys (I inverse the RDD to put them in the front).
However, if I clone the bytearrays before joining the RDDs, it seems to fix my 
problem (but I'm not sure why)

-- R.V



> Le 3 déc. 2015 à 15:51, Josh Rosen  a écrit :
> 
> Are they keys that you're joining on the bye arrays themselves? If so, that's 
> not likely to work because of how Java computes arrays' hashCodes; see 
> https://issues.apache.org/jira/browse/SPARK-597 
> . If this turns out to be 
> the problem, we should look into strengthening the checks for array-type keys 
> in order to detect and fail fast for this join() case.
> 
> On Thu, Dec 3, 2015 at 8:58 AM, Hervé Yviquel  > wrote:
> Hi all,
> 
> I have problem when using Array[Byte] in RDD operation.
> When I join two different RDDs of type [(Long, Array[Byte])], I obtain wrong 
> results... But if I translate the byte array in integer and join two 
> different RDDs of type [(Long, Integer)], then the results is correct... Any 
> idea ?
> 
> --
> The code:
> 
> val byteRDD0 = sc.binaryRecords(path_arg0, 4).zipWithIndex.map{x => (x._2, 
> x._1)}
> val byteRDD1 = sc.binaryRecords(path_arg1, 4).zipWithIndex.map{x => (x._2, 
> x._1)}
> 
> byteRDD0.foreach{x => println("BYTE0 " + x._1 + "=> " 
> +ByteBuffer.wrap(x._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
> byteRDD1.foreach{x => println("BYTE1 " + x._1 + "=> " 
> +ByteBuffer.wrap(x._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
> 
> val intRDD1 = byteRDD1.mapValues{x=> 
> ByteBuffer.wrap(x).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt()}
> val intRDD2 = byteRDD2.mapValues{x=> 
> ByteBuffer.wrap(x).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt()}
> 
> val byteJOIN = byteRDD1.join(byteRDD2)
> byteJOIN.foreach{x => println("BYTEJOIN " + x._1 + "=> " + 
> ByteBuffer.wrap(x._2._1).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt() + " 
> - 
> "+ByteBuffer.wrap(x._2._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
> 
> val intJOIN = intRDD1.join(intRDD2)
> intJOIN.foreach{x => println("INTJOIN " + x._1 + "=> " + x._2._1 + " - "+ 
> x._2._2)}
> 
> 
> --
> stdout:
> 
> BYTE0 0=> 1
> BYTE0 1=> 3
> BYTE0 2=> 5
> BYTE0 3=> 7
> BYTE0 4=> 9
> BYTE0 5=> 11
> BYTE0 6=> 13
> BYTE0 7=> 15
> BYTE0 8=> 17
> BYTE0 9=> 19
> BYTE0 10=> 21
> BYTE0 11=> 23
> BYTE0 12=> 25
> BYTE0 13=> 27
> BYTE0 14=> 29
> BYTE0 15=> 31
> BYTE0 16=> 33
> BYTE0 17=> 35
> BYTE0 18=> 37
> BYTE0 19=> 39
> BYTE0 20=> 41
> BYTE0 21=> 43
> BYTE0 22=> 45
> BYTE0 23=> 47
> BYTE0 24=> 49
> BYTE0 25=> 51
> BYTE0 26=> 53
> BYTE0 27=> 55
> BYTE0 28=> 57
> BYTE0 29=> 59
> BYTE1 0=> 0
> BYTE1 1=> 1
> BYTE1 2=> 2
> BYTE1 3=> 3
> BYTE1 4=> 4
> BYTE1 5=> 5
> BYTE1 6=> 6
> BYTE1 7=> 7
> BYTE1 8=> 8
> BYTE1 9=> 9
> BYTE1 10=> 10
> BYTE1 11=> 11
> BYTE1 12=> 12
> BYTE1 13=> 13
> BYTE1 14=> 14
> BYTE1 15=> 15
> BYTE1 16=> 16
> BYTE1 17=> 17
> BYTE1 18=> 18
> BYTE1 19=> 19
> BYTE1 20=> 20
> BYTE1 21=> 21
> BYTE1 22=> 22
> BYTE1 23=> 23
> BYTE1 24=> 24
> BYTE1 25=> 25
> BYTE1 26=> 26
> BYTE1 27=> 27
> BYTE1 28=> 28
> BYTE1 29=> 29
> BYTEJOIN 13=> 1 - 0
> BYTEJOIN 19=> 1 - 0
> BYTEJOIN 15=> 1 - 0
> BYTEJOIN 4=> 1 - 0
> BYTEJOIN 21=> 1 - 0
> BYTEJOIN 16=> 1 - 0
> BYTEJOIN 22=> 1 - 0
> BYTEJOIN 25=> 1 - 0
> BYTEJOIN 28=> 1 - 0
> BYTEJOIN 29=> 1 - 0
> BYTEJOIN 11=> 1 - 0
> BYTEJOIN 14=> 1 - 0
> BYTEJOIN 27=> 1 - 0
> BYTEJOIN 0=> 1 - 0
> BYTEJOIN 24=> 1 - 0
> BYTEJOIN 23=> 1 - 0
> BYTEJOIN 1=> 1 - 0
> BYTEJOIN 6=> 1 - 0
> BYTEJOIN 17=> 1 - 0
> BYTEJOIN 3=> 1 - 0
> BYTEJOIN 7=> 1 - 0
> BYTEJOIN 9=> 1 - 0
> BYTEJOIN 8=> 1 - 0
> BYTEJOIN 12=> 1 - 0
> BYTEJOIN 18=> 1 - 0
> BYTEJOIN 20=> 1 - 0
> BYTEJOIN 26=> 1 - 0
> BYTEJOIN 10=> 1 - 0
> BYTEJOIN 5=> 1 - 0
> BYTEJOIN 2=> 1 - 0
> INTJOIN 13=> 27 - 13
> INTJOIN 19=> 39 - 19
> INTJOIN 15=> 31 - 15
> INTJOIN 4=> 9 - 4
> INTJOIN 21=> 43 - 21
> INTJOIN 16=> 33 - 16
> INTJOIN 22=> 45 - 22
> INTJOIN 25=> 51 - 25
> INTJOIN 28=> 57 - 28
> INTJOIN 29=> 59 - 29
> INTJOIN 11=> 23 - 11
> INTJOIN 14=> 29 - 14
> INTJOIN 27=> 55 - 27
> INTJOIN 0=> 1 - 0
> INTJOIN 24=> 49 - 24
> INTJOIN 23=> 47 - 23
> INTJOIN 1=> 3 - 1
> INTJOIN 6=> 13 - 6
> INTJOIN 17=> 35 - 17
> INTJOIN 3=> 7 - 3
> INTJOIN 7=> 15 - 7
> INTJOIN 9=> 19 - 9
> INTJOIN 8=> 17 - 8
> INTJOIN 12=> 25 - 12
> INTJOIN 18=> 37 - 18
> INTJOIN 20=> 41 - 20
> INTJOIN 26=> 53 - 26
> INTJOIN 10=> 21 - 10
> INTJOIN 5=> 11 - 5
> INTJOIN 2=> 5 - 2
> 
> 
> Thanks,
> Hervé
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: Creating a dataframe with decimals changes the precision and scale

2015-12-03 Thread Ted Yu
Looks like what you observed is due to the following code in Decimal.scala :

  def set(decimal: BigDecimal, precision: Int, scale: Int): Decimal = {
this.decimalVal = decimal.setScale(scale, ROUND_HALF_UP)
require(
  decimalVal.precision <= precision,
  s"Decimal precision ${decimalVal.precision} exceeds max precision
$precision")

You can construct BigDecimal, use the following method on the instance:
http://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html#setScale(int,%20java.math.RoundingMode)

and pass to set(decimal: BigDecimal)

FYI

On Thu, Dec 3, 2015 at 9:46 AM, Philip Dodds  wrote:

> I'm not sure if there is a way around this just looking for advice,
>
> I create a dataframe from some decimals with a specific precision and
> scale,  then when I look at the dataframe it has defaulted the precision
> and scale back again.
>
> Is there a way to retain the precision and scale when doing a toDF()
>
> example code:
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
>
> val a = new Decimal().set(BigDecimal(50),14,4)
> val b = new Decimal().set(BigDecimal(50),14,4)
>
> val data = Seq.fill(5) {
>  (a,b)
>}
>
> val trades = data.toDF()
>
> trades.printSchema()
>
> the result of this code would show
>
> root
>  |-- _1: decimal(38,18) (nullable = true)
>  |-- _2: decimal(38,18) (nullable = true)
>
> sqlContext: org.apache.spark.sql.SQLContext = 
> org.apache.spark.sql.SQLContext@3a1a48f7
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
> a: org.apache.spark.sql.types.Decimal = 50.00
> b: org.apache.spark.sql.types.Decimal = 50.00
> data: Seq[(org.apache.spark.sql.types.Decimal, 
> org.apache.spark.sql.types.Decimal)] = 
> List((50.00,50.00), 
> (50.00,50.00), 
> (50.00,50.00), 
> (50.00,50.00), 
> (50.00,50.00))
> trades: org.apache.spark.sql.DataFrame = [_1: decimal(38,18), _2: 
> decimal(38,18)]
>
>
> Any advice would be brilliant
>
>
> Thanks
>
> P
>
>
> --
> Philip Dodds
>
> philip.do...@gmail.com
> @philipdodds
>
>


Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Mich Talebzadeh
Trying to run Hive on Spark 1.3 engine, I get

 

conf hive.spark.client.channel.log.level=null --conf
hive.spark.client.rpc.max.size=52428800 --conf
hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256

15/12/03 17:53:18 [stderr-redir-1]: INFO client.SparkClientImpl: Spark
assembly has been built with Hive, including Datanucleus jars on classpath

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
Ignoring non-spark config property: hive.spark.client.connect.timeout=1000

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
Ignoring non-spark config property: hive.spark.client.rpc.threads=8

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
Ignoring non-spark config property: hive.spark.client.rpc.max.size=52428800

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
Ignoring non-spark config property: hive.spark.client.secret.bits=256

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
Ignoring non-spark config property:
hive.spark.client.server.connect.timeout=9

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: 15/12/03
17:53:19 INFO client.RemoteDriver: Connecting to: rhes564:36577

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Exception
in thread "main" java.lang.NoSuchFieldError:
SPARK_RPC_CLIENT_CONNECT_TIMEOUT

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.
java:46)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:146)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57
)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
java.lang.reflect.Method.invoke(Method.java:606)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$ru
nMain(SparkSubmit.scala:569)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)

15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 

Any clues?

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 



Re: Local mode: Stages hang for minutes

2015-12-03 Thread Ali Tajeldin EDU
You can try to run "jstack" a couple of times while the app is hung to look for 
patterns  for where the app is hung.
--
Ali

On Dec 3, 2015, at 8:27 AM, Richard Marscher  wrote:

> I should add that the pauses are not from GC and also in tracing the CPU call 
> tree in the JVM it seems like nothing is doing any work, just seems to be 
> idling or blocking.
> 
> On Thu, Dec 3, 2015 at 11:24 AM, Richard Marscher  
> wrote:
> Hi,
> 
> I'm doing some testing of workloads using local mode on a server. I see weird 
> behavior where a job is submitted to the application and it just hangs for 
> several minutes doing nothing. The stages are submitted as pending and in the 
> application UI the stage view claims no tasks have been submitted. Suddenly 
> after a few minutes things suddenly start and run smoothly. 
> 
> I'm running against tiny data sets the size of 10s to low 100s of items in 
> the RDD. I've been attaching with JProfiler, doing thread and heap dumps but 
> nothing is really standing out as to why Spark seems to periodically pause 
> for such a long time.
> 
> Has anyone else seen similar behavior or aware of some quirk of local mode 
> that could cause this kind of blocking?
> 
> -- 
> Richard Marscher
> Software Engineer
> Localytics
> Localytics.com | Our Blog | Twitter | Facebook | LinkedIn
> 
> 
> 
> -- 
> Richard Marscher
> Software Engineer
> Localytics
> Localytics.com | Our Blog | Twitter | Facebook | LinkedIn



Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Marcelo Vanzin
On Thu, Dec 3, 2015 at 10:32 AM, Mich Talebzadeh  wrote:

> hduser@rhes564::/usr/lib/spark/logs> hive --version
> SLF4J: Found binding in
> [jar:file:/usr/lib/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

As I suggested before, you have Spark's assembly in the Hive
classpath. That's not the way to configure hive-on-spark; if the
documentation you're following tells you to do that, it's wrong.

(And sorry Ted, but please ignore Ted's suggestion. Hive-on-Spark
should work fine with Spark 1.3 if it's configured correctly. You
really don't want to be overriding Hive classes with the ones shipped
in the Spark assembly, regardless of the version of Spark being used.)

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org