Re: Running in cluster mode causes native library linking to fail

2015-10-13 Thread Deenar Toraskar
Hi Bernardo

Is the native library installed on all machines of your cluster and are you
setting both the spark.driver.extraLibraryPath and
spark.executor.extraLibraryPath ?

Deenar



On 14 October 2015 at 05:44, Bernardo Vecchia Stein <
bernardovst...@gmail.com> wrote:

> Hello,
>
> I am trying to run some scala code in cluster mode using spark-submit.
> This code uses addLibrary to link with a .so that exists in the machine,
> and this library has a function to be called natively (there's a native
> definition as needed in the code).
>
> The problem I'm facing is: whenever I try to run this code in cluster
> mode, spark fails with the following message when trying to execute the
> native function:
> java.lang.UnsatisfiedLinkError:
> org.name.othername.ClassName.nativeMethod([B[B)[B
>
> Apparently, the library is being found by spark, but the required function
> isn't found.
>
> When trying to run in client mode, however, this doesn't fail and
> everything works as expected.
>
> Does anybody have any idea of what might be the problem here? Is there any
> bug that could be related to this when running in cluster mode?
>
> I appreciate any help.
> Thanks,
> Bernardo
>


Re: When does python program started in pyspark

2015-10-13 Thread canan chen
I think PythonRunner is launched when executing python script.
PythonGatewayServer is entry point for python spark shell


if (args.isPython && deployMode == CLIENT) {
  if (args.primaryResource == PYSPARK_SHELL) {
args.mainClass = "org.apache.spark.api.python.PythonGatewayServer"
  } else {
// If a python file is provided, add it to the child arguments and
list of files to deploy.
// Usage: PythonAppRunner  
[app arguments]
args.mainClass = "org.apache.spark.deploy.PythonRunner"
args.childArgs = ArrayBuffer(args.primaryResource, args.pyFiles)
++ args.childArgs
if (clusterManager != YARN) {
  // The YARN backend distributes the primary file differently, so
don't merge it.
  args.files = mergeFileLists(args.files, args.primaryResource)
}
  }


On Wed, Oct 14, 2015 at 12:46 PM, skaarthik oss 
wrote:

> See PythonRunner @
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala
>
> On Tue, Oct 13, 2015 at 7:50 PM, canan chen  wrote:
>
>> I look at the source code of spark, but didn't find where python program
>> is started in python.
>>
>> It seems spark-submit will call PythonGatewayServer, but where is python
>> program started ?
>>
>> Thanks
>>
>
>


Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
Hi Michael,

Can you be more specific on `collect_set`? Is it a built-in function or, if
it is an UDF, how it is defined?

BR,
Todd Leo

On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust 
wrote:

> import org.apache.spark.sql.functions._
>
> df.groupBy("category")
>   .agg(callUDF("collect_set", df("id")).as("id_list"))
>
> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu 
> wrote:
>
>> Hey Spark users,
>>
>> I'm trying to group by a dataframe, by appending occurrences into a list
>> instead of count.
>>
>> Let's say we have a dataframe as shown below:
>>
>> | category | id |
>> |  |:--:|
>> | A| 1  |
>> | A| 2  |
>> | B| 3  |
>> | B| 4  |
>> | C| 5  |
>>
>> ideally, after some magic group by (reverse explode?):
>>
>> | category | id_list  |
>> |  |  |
>> | A| 1,2  |
>> | B| 3,4  |
>> | C| 5|
>>
>> any tricks to achieve that? Scala Spark API is preferred. =D
>>
>> BR,
>> Todd Leo
>>
>>
>>
>>
>


Re: Building with SBT and Scala 2.11

2015-10-13 Thread Adrian Tanase
Do you mean hadoop-2.4 or 2.6? not sure if this is the issue but I'm also 
compiling the 1.5.1 version with scala 2.11 and hadoop 2.6 and it works.

-adrian

Sent from my iPhone

On 14 Oct 2015, at 03:53, Jakob Odersky 
> wrote:

I'm having trouble compiling Spark with SBT for Scala 2.11. The command I use 
is:

dev/change-version-to-2.11.sh
build/sbt -Pyarn -Phadoop-2.11 -Dscala-2.11

followed by

compile

in the sbt shell.

The error I get specifically is:

spark/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308: no 
valid targets for annotation on value conf - it is discarded unused. You may 
specify targets with meta-annotations, e.g. @(transient @param)
[error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
[error]

However I am also getting a large amount of deprecation warnings, making me 
wonder if I am supplying some incompatible/unsupported options to sbt. I am 
using Java 1.8 and the latest Spark master sources.
Does someone know if I am doing anything wrong or is the sbt build broken?

thanks for you help,
--Jakob



When does python program started in pyspark

2015-10-13 Thread canan chen
I look at the source code of spark, but didn't find where python program is
started in python.

It seems spark-submit will call PythonGatewayServer, but where is python
program started ?

Thanks


Running in cluster mode causes native library linking to fail

2015-10-13 Thread Bernardo Vecchia Stein
Hello,

I am trying to run some scala code in cluster mode using spark-submit. This
code uses addLibrary to link with a .so that exists in the machine, and
this library has a function to be called natively (there's a native
definition as needed in the code).

The problem I'm facing is: whenever I try to run this code in cluster mode,
spark fails with the following message when trying to execute the native
function:
java.lang.UnsatisfiedLinkError:
org.name.othername.ClassName.nativeMethod([B[B)[B

Apparently, the library is being found by spark, but the required function
isn't found.

When trying to run in client mode, however, this doesn't fail and
everything works as expected.

Does anybody have any idea of what might be the problem here? Is there any
bug that could be related to this when running in cluster mode?

I appreciate any help.
Thanks,
Bernardo


Re: an problem about zippartition

2015-10-13 Thread Saisai Shao
maybe you could try "localCheckpoint" insteadly.

2015年10月14日星期三,张仪yf1  写道:

> Thank you for your reply. It helped a lot. But when the data became
> bigger, the action cost more, is there any optimizer
>
>
>
> *发件人:* Saisai Shao [mailto:sai.sai.s...@gmail.com
> ]
> *发送时间:* 2015年10月13日 20:29
> *收件人:* 张仪yf1
> *抄送:* user@spark.apache.org
> 
> *主题:* Re: an problem about zippartition
>
>
>
> You have to call the checkpoint regularly on rdd0 to cut the dependency
> chain, otherwise you will meet such problem as you mentioned, even stack
> overflow finally. This is a classic problem for high iterative job, you
> could google it for the fix solution.
>
>
>
> On Tue, Oct 13, 2015 at 7:09 PM, 张仪yf1  > wrote:
>
>
>
>
>
>
>
> Hi,there
>
>  I problem an issue when using the zippartition, first I created a
> rdd from a seq,then created another one,and zippartitioned them with rdd3,
> then cached the rdd3,then created a new rdd ,and zippartitioned it with
> rdd3.I repeat this operation many times, and I found that,
>
> The task serialized became bigger and bigger, the serialize time cost
> became bigger too.Does anyone else encounter the same problem ,please help.
>
> Code:
>
> *def *main(args: Array[String]) {
>   *var *i: Int = 0
>   *var *seq = *Seq*[String]()
>   *while *(i < 720) {
> {
>   seq = seq.+:(
> *"aa"*
> )
> }
> i = i + 1
>   }
>   *val *sc = *new *SparkContext(*"spark://hdh010016146205:7077"*,
> *"ScalaTest"*)
>   *var *rdd0: RDD[String] =
> *null   var *j = 0;
>   *while *(j < 200) {
> j = j + 1
> *val *rdd1 = sc.parallelize(seq, 72)
> *if *(rdd0 == *null*) {
>   rdd0 = rdd1
>   rdd0.cache()
>   rdd0.count()
> } *else *{
>   *val *rdd2 = *new *BillZippedPartitionsRDD2[String, String,
> String](sc, { (thisIter, otherIter) =>
>
> *//val rdd2 = rdd0.zipPartitions(rdd1, true)({ (thisIter,
> otherIter) => **new *Iterator[String] {
>   *def *hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) 
> *match
> *{
> *case *(*true*, *true*) =>
> *true case *(*false*, *false*) =>
> *false case *_ => *throw new *SparkException(*"Can only zip
> RDDs with " *+
>   *"same number of elements in each partition"*)
>   }
>   *def *next(): String = (thisIter.next() + *"--" *+
> otherIter.next())
> }
>   }, rdd0, rdd1, *false*)
>   rdd2.cache()
>   rdd2.count()
>   rdd0 = rdd2
> }
>   }
> }
>
>
>
>
>


OutOfMemoryError When Reading Many json Files

2015-10-13 Thread SLiZn Liu
Hey Spark Users,

I kept getting java.lang.OutOfMemoryError: Java heap space as I read a
massive amount of json files, iteratively via read.json(). Even the result
RDD is rather small, I still get the OOM Error. The brief structure of my
program reads as following, in psuedo-code:

file_path_list.map{ jsonFile: String =>
  sqlContext.read.json(jsonFile)
.select($"some", $"fields")
.withColumn("new_col", some_transformations($"col"))
.rdd.map( x: Row => (k, v) )
.combineByKey() // which groups a column into item lists by
another column as keys
}.reduce( (i, j) => i.union(j) )
.combineByKey() // which combines results from all json files

I confess some of the json files are Gigabytes huge, yet the combined RDD
is in a few Megabytes. I’m not familiar with the under-the-hood mechanism,
but my intuitive understanding of how the code executes is, read the file
once a time (where I can easily modify map to foreach when fetching from
file_path_list, if that’s the case), do the inner transformation on DF and
combine, then reduce and do the outer combine immediately, which doesn’t
require to hold all RDDs generated from all files in the memory. Obviously,
as my code raises OOM Error, I must have missed something important.

>From the debug log, I can tell the OOM Error happens when reading the same
file, which is in a modest size of 2GB, while driver.memory is set to 13GB,
and the available memory size before the code execution is around 8GB, on
my standalone machine running as “local[8]”.

To overcome this, I also tried to initialize an empty universal RDD
variable, iteratively read one file at a time using foreach, then instead
of reduce, simply combine each RDD generated by the json files, except the
OOM Error remains.

Other configurations:

   - set(“spark.storage.memoryFraction”, “0.1”) // no cache of RDD is used
   - set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)

Any suggestions other than scale up/out the spark cluster?

BR,
Todd Leo
​


Re: compatibility issue with Jersey2

2015-10-13 Thread Mingyu Kim
Hi all,

I filed https://issues.apache.org/jira/browse/SPARK-11081. Since Jersey’s 
surface area is relatively small and seems to be only used for Spark UI and 
json API, shading the dependency might make sense similar to what’s done for 
Jerry dependencies at https://issues.apache.org/jira/browse/SPARK-3996. Would 
this be reasonable?

Mingyu







On 10/7/15, 11:26 AM, "Marcelo Vanzin"  wrote:

>Seems like you might be running into
>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D10910=CQIBaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84=GuNlWwLNE7UP5euS6Ccu86dUSs1AuiouVOM3bTeZuoQ=Z23j4oFFQ12DNJYiFfXFsXPlpav2HD0W0eZqVEhjjOk=
> . I've been busy with
>other things but plan to take a look at that one when I find time...
>right now I don't really have a solution, other than making sure your
>application's jars do not include those classes the exception is
>complaining about.
>
>On Wed, Oct 7, 2015 at 10:23 AM, Gary Ogden  wrote:
>> What you suggested seems to have worked for unit tests. But now it throws
>> this at run time on mesos with spark-submit:
>>
>> Exception in thread "main" java.lang.LinkageError: loader constraint
>> violation: when resolving method
>> "org.slf4j.impl.StaticLoggerBinder.getLoggerFactory()Lorg/slf4j/ILoggerFactory;"
>> the class loader (instance of
>> org/apache/spark/util/ChildFirstURLClassLoader) of the current class,
>> org/slf4j/LoggerFactory, and the class loader (instance of
>> sun/misc/Launcher$AppClassLoader) for resolved class,
>> org/slf4j/impl/StaticLoggerBinder, have different Class objects for the type
>> LoggerFactory; used in the signature
>>  at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:336)
>>  at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:284)
>>  at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:305)
>>  at com.company.spark.utils.SparkJob.(SparkJob.java:41)
>>  at java.lang.Class.forName0(Native Method)
>>  at java.lang.Class.forName(Unknown Source)
>>  at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:634)
>>  at 
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
>>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
>>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
>>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>> On 6 October 2015 at 16:20, Marcelo Vanzin  wrote:
>>>
>>> On Tue, Oct 6, 2015 at 12:04 PM, Gary Ogden  wrote:
>>> > But we run unit tests differently in our build environment, which is
>>> > throwing the error. It's setup like this:
>>> >
>>> > I suspect this is what you were referring to when you said I have a
>>> > problem?
>>>
>>> Yes, that is what I was referring to. But, in your test environment,
>>> you might be able to work around the problem by setting
>>> "spark.ui.enabled=false"; that should disable all the code that uses
>>> Jersey, so you can use your newer version in your unit tests.
>>>
>>>
>>> --
>>> Marcelo
>>
>>
>
>
>
>-- 
>Marcelo
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

smime.p7s
Description: S/MIME cryptographic signature


Re: Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-13 Thread Robineast
What you have done should work.

A couple of things to try:

1) you should have a lib directory in your Spark deployment, it should have
a jar file called lib/spark-assembly-1.5.1-hadoop2.6.0.jar. Is it there?
2) Have you set the JAVA_HOME variable to point to your java8 deployment? If
not try doing that.

Robin



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Install-via-directions-in-Learning-Spark-Exception-when-running-bin-pyspark-tp25043p25048.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HiveThriftServer not registering with Zookeeper

2015-10-13 Thread Xiaoyu Wang

I have the same issue.
I think spark thrift server is not suport HA with zookeeper now.

在 2015年09月01日 18:10, sreeramvenkat 写道:

Hi,

  I am trying to setup dynamic service discovery for HiveThriftServer in a
two node cluster.

In the thrift server logs, I am not seeing itself registering with zookeeper
- no znode is getting created.

Pasting relevant section from my $SPARK_HOME/conf/hive-site.xml


   hive.zookeeper.quorum
   host1:port1,host2:port2



   hive.server2.support.dynamic.service.discovery
   true



   hive.server2.zookeeper.namespace
   hivethriftserver2



   hive.zookeeper.client.port
   2181



  Any help is appreciated.

PS: Zookeeper is working fine and zknodes are getting created with
hiveserver2. This issue happens only with hivethriftserver.

Regards,
Sreeram



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-not-registering-with-Zookeeper-tp24534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HiveThriftServer not registering with Zookeeper

2015-10-13 Thread Xiaoyu Wang

I have the same issue.
I think spark thrift server is not suport HA with zookeeper now.

在 2015年09月01日 18:10, sreeramvenkat 写道:

Hi,

  I am trying to setup dynamic service discovery for HiveThriftServer in a
two node cluster.

In the thrift server logs, I am not seeing itself registering with zookeeper
- no znode is getting created.

Pasting relevant section from my $SPARK_HOME/conf/hive-site.xml


   hive.zookeeper.quorum
   host1:port1,host2:port2



   hive.server2.support.dynamic.service.discovery
   true



   hive.server2.zookeeper.namespace
   hivethriftserver2



   hive.zookeeper.client.port
   2181



  Any help is appreciated.

PS: Zookeeper is working fine and zknodes are getting created with
hiveserver2. This issue happens only with hivethriftserver.

Regards,
Sreeram



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-not-registering-with-Zookeeper-tp24534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-13 Thread Akhil Das
You need to add "org.apache.spark" % "spark-streaming_2.10" % "1.5.0" to
the dependencies list.

Thanks
Best Regards

On Tue, Oct 6, 2015 at 3:20 PM, shahab  wrote:

> Hi,
>
> I am trying to use Spark 1.5, Mlib, but I keep getting
>  "sbt.ResolveException: unresolved dependency:
> org.apache.spark#spark-streaming_2.10;1.5.0: not found" .
>
> It is weird that this happens, but I could not find any solution for this.
> Does any one faced the same issue?
>
>
> best,
> /Shahab
>
> Here is my SBT library dependencies:
>
> libraryDependencies ++= Seq(
>
> "com.google.guava" % "guava" % "16.0"  ,
>
> "org.apache.spark" % "spark-unsafe_2.10" % "1.5.0",
>
> "org.apache.spark" % "spark-core_2.10" % "1.5.0",
>
> "org.apache.spark" % "spark-mllib_2.10" % "1.5.0",
>
> "org.apache.hadoop" % "hadoop-client" % "2.6.0",
>
> "net.java.dev.jets3t" % "jets3t" % "0.9.0" % "provided",
>
> "com.github.nscala-time" %% "nscala-time" % "1.0.0",
>
> "org.scalatest" % "scalatest_2.10" % "2.1.3",
>
> "junit" % "junit" % "4.8.1" % "test",
>
> "net.jpountz.lz4" % "lz4" % "1.2.0" % "provided",
>
> "org.clapper" %% "grizzled-slf4j" % "1.0.2",
>
> "net.jpountz.lz4" % "lz4" % "1.2.0" % "provided"
>
>)
>


Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted if fix went after 1.5.1 release then how come it's working with
1.5.1 binary in spark-shell.
On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:

> Looks like the fix went in after 1.5.1 was released.
>
> You may verify using master branch build.
>
> Cheers
>
> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>
> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
> you mentioned it works using 1.5.1 but it doesn't compile in Java using
> 1.5.1 maven libraries it still complains same that callUdf can have string
> and column types only. Please guide.
> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>
>> SQL context available as sqlContext.
>>
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>> "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>
>> scala> df.select(callUDF("percentile_approx",col("value"),
>> lit(0.25))).show()
>> +--+
>> |'percentile_approx(value,0.25)|
>> +--+
>> |   1.0|
>> +--+
>>
>> Can you upgrade to 1.5.1 ?
>>
>> Cheers
>>
>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
>> wrote:
>>
>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>>> in Spark 1.4.0 as per JAvadocx
>>>
>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>>> wrote:
>>>
 Hi Ted thanks much for the detailed answer and appreciate your efforts.
 Do we need to register Hive UDFs?

 sqlContext.udf.register("percentile_approx");???//is it valid?

 I am calling Hive UDF percentile_approx in the following manner which
 gives compilation error

 df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
 error

 //compile error because callUdf() takes String and Column* as arguments.

 Please guide. Thanks much.

 On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:

> Using spark-shell, I did the following exercise (master branch) :
>
>
> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
> "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
> v + cnst)
> res0: org.apache.spark.sql.UserDefinedFunction =
> UserDefinedFunction(,IntegerType,List())
>
> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
> +---++
> | id|'simpleUDF(value,25)|
> +---++
> |id1|  26|
> |id2|  41|
> |id3|  50|
> +---++
>
> Which Spark release are you using ?
>
> Can you pastebin the full stack trace where you got the error ?
>
> Cheers
>
> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
> wrote:
>
>> I have a doubt Michael I tried to use callUDF in  the following code
>> it does not work.
>>
>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>
>> Above code does not compile because callUdf() takes only two
>> arguments function name in String and Column class type. Please guide.
>>
>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>> wrote:
>>
>>> thanks much Michael let me try.
>>>
>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 This is confusing because I made a typo...

 callUDF("percentile_approx", col("mycol"), lit(0.25))

 The first argument is the name of the UDF, all other arguments need
 to be columns that are passed in as arguments.  lit is just saying to 
 make
 a literal column that always has the value 0.25.

 On Fri, Oct 9, 2015 at 12:16 PM, 
 wrote:

> Yes but I mean, this is rather curious. How is def
> lit(literal:Any) --> becomes a percentile function lit(25)
>
>
>
> Thanks for clarification
>
> Saif
>
>
>
> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
> *Sent:* Friday, October 09, 2015 4:10 PM
> *To:* Ellafi, Saif A.
> *Cc:* Michael Armbrust; user
>
> *Subject:* Re: How to calculate percentile of a column of
> DataFrame?
>
>
>
> I found it in 1.3 documentation lit says something else not percent
>
>
>
> public static Column 
> 

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you
mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1
maven libraries it still complains same that callUdf can have string and
column types only. Please guide.
On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:

> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> df.select(callUDF("percentile_approx",col("value"),
> lit(0.25))).show()
> +--+
> |'percentile_approx(value,0.25)|
> +--+
> |   1.0|
> +--+
>
> Can you upgrade to 1.5.1 ?
>
> Cheers
>
> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
> wrote:
>
>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>> in Spark 1.4.0 as per JAvadocx
>>
>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>> wrote:
>>
>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>> Do we need to register Hive UDFs?
>>>
>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>
>>> I am calling Hive UDF percentile_approx in the following manner which
>>> gives compilation error
>>>
>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>> error
>>>
>>> //compile error because callUdf() takes String and Column* as arguments.
>>>
>>> Please guide. Thanks much.
>>>
>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>
 Using spark-shell, I did the following exercise (master branch) :


 SQL context available as sqlContext.

 scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
 "value")
 df: org.apache.spark.sql.DataFrame = [id: string, value: int]

 scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
 v + cnst)
 res0: org.apache.spark.sql.UserDefinedFunction =
 UserDefinedFunction(,IntegerType,List())

 scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
 +---++
 | id|'simpleUDF(value,25)|
 +---++
 |id1|  26|
 |id2|  41|
 |id3|  50|
 +---++

 Which Spark release are you using ?

 Can you pastebin the full stack trace where you got the error ?

 Cheers

 On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
 wrote:

> I have a doubt Michael I tried to use callUDF in  the following code
> it does not work.
>
> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>
> Above code does not compile because callUdf() takes only two arguments
> function name in String and Column class type. Please guide.
>
> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
> wrote:
>
>> thanks much Michael let me try.
>>
>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> This is confusing because I made a typo...
>>>
>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>
>>> The first argument is the name of the UDF, all other arguments need
>>> to be columns that are passed in as arguments.  lit is just saying to 
>>> make
>>> a literal column that always has the value 0.25.
>>>
>>> On Fri, Oct 9, 2015 at 12:16 PM, 
>>> wrote:
>>>
 Yes but I mean, this is rather curious. How is def lit(literal:Any)
 --> becomes a percentile function lit(25)



 Thanks for clarification

 Saif



 *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
 *Sent:* Friday, October 09, 2015 4:10 PM
 *To:* Ellafi, Saif A.
 *Cc:* Michael Armbrust; user

 *Subject:* Re: How to calculate percentile of a column of
 DataFrame?



 I found it in 1.3 documentation lit says something else not percent



 public static Column 
 
  lit(Object literal)

 Creates a Column
 
  of
 literal value.

 The passed in object is returned directly if it is already a Column
 .
 If the object is a Scala Symbol, it is converted into a Column
 

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
Pardon me.
I didn't read your previous response clearly.

I will try to reproduce the compilation error on master branch.
Right now, I have some other high priority task on hand.

BTW I was looking at SPARK-10671

FYI

On Tue, Oct 13, 2015 at 1:42 AM, Umesh Kacha  wrote:

> Hi Ted if fix went after 1.5.1 release then how come it's working with
> 1.5.1 binary in spark-shell.
> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>
>> Looks like the fix went in after 1.5.1 was released.
>>
>> You may verify using master branch build.
>>
>> Cheers
>>
>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>
>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>> 1.5.1 maven libraries it still complains same that callUdf can have string
>> and column types only. Please guide.
>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>> lit(0.25))).show()
>>> +--+
>>> |'percentile_approx(value,0.25)|
>>> +--+
>>> |   1.0|
>>> +--+
>>>
>>> Can you upgrade to 1.5.1 ?
>>>
>>> Cheers
>>>
>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
>>> wrote:
>>>
 Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
 available in Spark 1.4.0 as per JAvadocx

 On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
 wrote:

> Hi Ted thanks much for the detailed answer and appreciate your
> efforts. Do we need to register Hive UDFs?
>
> sqlContext.udf.register("percentile_approx");???//is it valid?
>
> I am calling Hive UDF percentile_approx in the following manner which
> gives compilation error
>
> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
> error
>
> //compile error because callUdf() takes String and Column* as
> arguments.
>
> Please guide. Thanks much.
>
> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>
>> Using spark-shell, I did the following exercise (master branch) :
>>
>>
>> SQL context available as sqlContext.
>>
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>> "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>
>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>> * v + cnst)
>> res0: org.apache.spark.sql.UserDefinedFunction =
>> UserDefinedFunction(,IntegerType,List())
>>
>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>> lit(25))).show()
>> +---++
>> | id|'simpleUDF(value,25)|
>> +---++
>> |id1|  26|
>> |id2|  41|
>> |id3|  50|
>> +---++
>>
>> Which Spark release are you using ?
>>
>> Can you pastebin the full stack trace where you got the error ?
>>
>> Cheers
>>
>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>> wrote:
>>
>>> I have a doubt Michael I tried to use callUDF in  the following code
>>> it does not work.
>>>
>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>
>>> Above code does not compile because callUdf() takes only two
>>> arguments function name in String and Column class type. Please guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>> wrote:
>>>
 thanks much Michael let me try.

 On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
 mich...@databricks.com> wrote:

> This is confusing because I made a typo...
>
> callUDF("percentile_approx", col("mycol"), lit(0.25))
>
> The first argument is the name of the UDF, all other arguments
> need to be columns that are passed in as arguments.  lit is just 
> saying to
> make a literal column that always has the value 0.25.
>
> On Fri, Oct 9, 2015 at 12:16 PM, 
> wrote:
>
>> Yes but I mean, this is rather curious. How is def
>> lit(literal:Any) --> becomes a percentile function lit(25)
>>
>>
>>
>> Thanks for clarification
>>
>> Saif
>>
>>
>>
>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>> *Sent:* 

writing to hive

2015-10-13 Thread Hafiz Mujadid
hi!

I am following  this
 
 
tutorial to read and write from hive. But i am facing following exception
when i run the code.

15/10/12 14:57:36 INFO storage.BlockManagerMaster: Registered BlockManager
15/10/12 14:57:38 INFO scheduler.EventLoggingListener: Logging events to
hdfs://host:9000/spark/logs/local-1444676256555
Exception in thread "main" java.lang.VerifyError: Bad return type
Exception Details:
  Location:
   
org/apache/spark/sql/catalyst/expressions/Pmod.inputType()Lorg/apache/spark/sql/types/AbstractDataType;
@3: areturn
  Reason:
Type 'org/apache/spark/sql/types/NumericType$' (current frame, stack[0])
is not assignable to 'org/apache/spark/sql/types/AbstractDataType' (from
method signature)
  Current Frame:
bci: @3
flags: { }
locals: { 'org/apache/spark/sql/catalyst/expressions/Pmod' }
stack: { 'org/apache/spark/sql/types/NumericType$' }
  Bytecode:
000: b200 63b0

at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2595)
at java.lang.Class.getConstructor0(Class.java:2895)
at java.lang.Class.getDeclaredConstructor(Class.java:2066)
at
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$4.apply(FunctionRegistry.scala:267)
at
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$4.apply(FunctionRegistry.scala:267)
at scala.util.Try$.apply(Try.scala:161)
at
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.expression(FunctionRegistry.scala:267)
at
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.(FunctionRegistry.scala:148)
at
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.(FunctionRegistry.scala)
at
org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:414)
at
org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:413)
at
org.apache.spark.sql.UDFRegistration.(UDFRegistration.scala:39)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:203)
at
org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)


Is there any suggestion how to read and write in hive?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hive-tp25046.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
Looks like the fix went in after 1.5.1 was released. 

You may verify using master branch build. 

Cheers

> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
> 
> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you 
> mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1 
> maven libraries it still complains same that callUdf can have string and 
> column types only. Please guide.
> 
>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>> SQL context available as sqlContext.
>> 
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>> 
>> scala> df.select(callUDF("percentile_approx",col("value"), lit(0.25))).show()
>> +--+
>> |'percentile_approx(value,0.25)|
>> +--+
>> |   1.0|
>> +--+
>> 
>> Can you upgrade to 1.5.1 ?
>> 
>> Cheers
>> 
>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha  wrote:
>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available in 
>>> Spark 1.4.0 as per JAvadocx
>>> 
 On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha  
 wrote:
 Hi Ted thanks much for the detailed answer and appreciate your efforts. Do 
 we need to register Hive UDFs?
 
 sqlContext.udf.register("percentile_approx");???//is it valid?
 
 I am calling Hive UDF percentile_approx in the following manner which 
 gives compilation error
 
 df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
  error
 
 //compile error because callUdf() takes String and Column* as arguments.
 
 Please guide. Thanks much.
 
> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
> Using spark-shell, I did the following exercise (master branch) :
> 
> 
> SQL context available as sqlContext.
> 
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", 
> "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
> 
> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v 
> + cnst)
> res0: org.apache.spark.sql.UserDefinedFunction = 
> UserDefinedFunction(,IntegerType,List())
> 
> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
> +---++
> | id|'simpleUDF(value,25)|
> +---++
> |id1|  26|
> |id2|  41|
> |id3|  50|
> +---++
> 
> Which Spark release are you using ?
> 
> Can you pastebin the full stack trace where you got the error ?
> 
> Cheers
> 
>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha  
>> wrote:
>> I have a doubt Michael I tried to use callUDF in  the following code it 
>> does not work. 
>> 
>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>> 
>> Above code does not compile because callUdf() takes only two arguments 
>> function name in String and Column class type. Please guide.
>> 
>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha  
>>> wrote:
>>> thanks much Michael let me try. 
>>> 
 On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust 
  wrote:
 This is confusing because I made a typo...
 
 callUDF("percentile_approx", col("mycol"), lit(0.25))
 
 The first argument is the name of the UDF, all other arguments need to 
 be columns that are passed in as arguments.  lit is just saying to 
 make a literal column that always has the value 0.25.
 
> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
> Yes but I mean, this is rather curious. How is def lit(literal:Any) 
> --> becomes a percentile function lit(25)
> 
>  
> 
> Thanks for clarification
> 
> Saif
> 
>  
> 
> From: Umesh Kacha [mailto:umesh.ka...@gmail.com] 
> Sent: Friday, October 09, 2015 4:10 PM
> To: Ellafi, Saif A.
> Cc: Michael Armbrust; user
> 
> 
> Subject: Re: How to calculate percentile of a column of DataFrame?
>  
> 
> I found it in 1.3 documentation lit says something else not percent
> 
>  
> 
> public static Column lit(Object literal)
> Creates a Column of literal value.
> 
> The passed in object is returned directly if it is already a Column. 
> If the object is a Scala Symbol, it is converted into a Column also. 
> Otherwise, 

Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra

Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE clause 
even though spark.sql.orc.filterPushdown=true) can be related to some 
factors below ?


- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee


On 10/09/2015 08:01 PM, Zhan Zhang wrote:
That is weird. Unfortunately, there is no debug info available on this 
part. Can you please open a JIRA to add some debug information on the 
driver side?


Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee > wrote:


I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). 
But from the log No ORC pushdown predicate for my query with WHERE 
clause.


15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate is 
right based on your setting, : leaf-0 = (EQUALS x 320)


The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee  wrote:


Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z 
from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and z 
>= 2 and z <= 8", column "x", "y" is not partition column, the 
others are partition columns. I expected the system will use 
predicate pushdown. I turned on the debug and found pushdown 
predicate was not generated ("DEBUG OrcInputFormat: No ORC pushdown 
predicate")


Then I tried to set the search argument explicitly (on the column 
"x" which is not partition column)


   val xs = 
SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results 
was wrong (no results at all)


15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: 
leaf-0 = (EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is 
not applied by the system?


BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be 
applied. Partition pruning and predicate pushdown does not have 
effect. Do you see big IO difference between two methods?


The potential reason of the speed difference I can think of may be 
the different versions of OrcInputFormat. The hive path may use 
NewOrcInputFormat, but the spark path use OrcInputFormat.


Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee  
wrote:


Yes, the predicate pushdown is enabled, but still take longer 
time than the first method


BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
 wrote:



Hi,

I am using spark sql 1.5 to query a hive table stored as 
partitioned orc file. We have the total files is about 6000 
files and each file size is about 245MB.


What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from 
the temp table


val c = 
hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")

c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 
6000 files) , the second case is much slower then the first 
one. Any ideas why?


BR,




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 

For additional commands, e-mail: 
user-h...@spark.apache.org


















Re: Spark DataFrame GroupBy into List

2015-10-13 Thread Rishitesh Mishra
Hi Liu,
I could not see any operator on DataFrame which will give the desired
result . DataFrame APIs as expected works on Row format and a fixed set of
operators on them.
However you can achive the desired result by accessing the internal RDD as
below..

val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2))
val rdd = testSparkContext.parallelize(s)
val df = snc.createDataFrame(rdd)
val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1

val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) }

val rdd3 = rdd1.reduceByKey(reduceF)
rdd3.foreach(r => println(r))



You can always reconvert the obtained RDD after tranformation and
reduce to a DataFrame.


Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo=nav_responsive_tab_profile

On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu  wrote:

> Hey Spark users,
>
> I'm trying to group by a dataframe, by appending occurrences into a list
> instead of count.
>
> Let's say we have a dataframe as shown below:
>
> | category | id |
> |  |:--:|
> | A| 1  |
> | A| 2  |
> | B| 3  |
> | B| 4  |
> | C| 5  |
>
> ideally, after some magic group by (reverse explode?):
>
> | category | id_list  |
> |  |  |
> | A| 1,2  |
> | B| 3,4  |
> | C| 5|
>
> any tricks to achieve that? Scala Spark API is preferred. =D
>
> BR,
> Todd Leo
>
>
>
>


--


an problem about zippartition

2015-10-13 Thread 张仪yf1



Hi,there
 I problem an issue when using the zippartition, first I created a rdd 
from a seq,then created another one,and zippartitioned them with rdd3, then 
cached the rdd3,then created a new rdd ,and zippartitioned it with rdd3.I 
repeat this operation many times, and I found that,
The task serialized became bigger and bigger, the serialize time cost became 
bigger too.Does anyone else encounter the same problem ,please help.
Code:
def main(args: Array[String]) {
  var i: Int = 0
  var seq = Seq[String]()
  while (i < 720) {
{
  seq = 
seq.+:("aa")
}
i = i + 1
  }
  val sc = new SparkContext("spark://hdh010016146205:7077", "ScalaTest")
  var rdd0: RDD[String] = null
  var j = 0;
  while (j < 200) {
j = j + 1
val rdd1 = sc.parallelize(seq, 72)
if (rdd0 == null) {
  rdd0 = rdd1
  rdd0.cache()
  rdd0.count()
} else {
  val rdd2 = new BillZippedPartitionsRDD2[String, String, String](sc, { 
(thisIter, otherIter) =>
//val rdd2 = rdd0.zipPartitions(rdd1, true)({ (thisIter, 
otherIter) =>
new Iterator[String] {
  def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
case (true, true) => true
case (false, false) => false
case _ => throw new SparkException("Can only zip RDDs with " +
  "same number of elements in each partition")
  }
  def next(): String = (thisIter.next() + "--" + otherIter.next())
}
  }, rdd0, rdd1, false)
  rdd2.cache()
  rdd2.count()
  rdd0 = rdd2
}
  }
}



How can I use dynamic resource allocation option in spark-jobserver?

2015-10-13 Thread JUNG YOUSUN
Hi all,
I have some questions about spark -jobserver.
I deployed a spark-jobserver in yarn-client mode using docker.
I’d like to use dynamic resource allocation option for yarn in spark-jobserver.
How can I add this option?
And when will it be support 1.5.x version ?
(https://hub.docker.com/r/velvia/spark-jobserver/tags/)

Thanks in advance and Best Regards,
Yousun Jeong



Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
Hi Rishitesh,

I did it by CombineByKey, but your solution is more clear and readable, at
least doesn't require 3 lambda functions to get confused with. Will
definitely try it out tomorrow, thanks. 

Plus, OutOfMemoryError keeps bothering me as I read a massive amount of
json files, whereas the yielded RDD by CombineByKey is rather small. Anyway
I'll file another mail to describe this.

BR,
Todd Leo

Rishitesh Mishra 于2015年10月13日 周二19:05写道:

> Hi Liu,
> I could not see any operator on DataFrame which will give the desired
> result . DataFrame APIs as expected works on Row format and a fixed set of
> operators on them.
> However you can achive the desired result by accessing the internal RDD as
> below..
>
> val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2))
> val rdd = testSparkContext.parallelize(s)
> val df = snc.createDataFrame(rdd)
> val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1
>
> val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) }
>
> val rdd3 = rdd1.reduceByKey(reduceF)
> rdd3.foreach(r => println(r))
>
>
>
> You can always reconvert the obtained RDD after tranformation and reduce to a 
> DataFrame.
>
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
>
> https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo=nav_responsive_tab_profile
>
> On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu 
> wrote:
>
>> Hey Spark users,
>>
>> I'm trying to group by a dataframe, by appending occurrences into a list
>> instead of count.
>>
>> Let's say we have a dataframe as shown below:
>>
>> | category | id |
>> |  |:--:|
>> | A| 1  |
>> | A| 2  |
>> | B| 3  |
>> | B| 4  |
>> | C| 5  |
>>
>> ideally, after some magic group by (reverse explode?):
>>
>> | category | id_list  |
>> |  |  |
>> | A| 1,2  |
>> | B| 3,4  |
>> | C| 5|
>>
>> any tricks to achieve that? Scala Spark API is preferred. =D
>>
>> BR,
>> Todd Leo
>>
>>
>>
>>
>
>
> --
>
>
>


Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Joshua Fox
I am accessing the Spark Jobs Web GUI, running on AWS EMR.

I can access this webapp (port 4040 as per default), but it only
half-renders, producing "Uncaught SyntaxError: Unexpected token <"

Here is a screenshot  including Chrome
Developer Console.

  [image: Screenshot] 

Here are some of the error messages in my Chrome console.





*Uncaught SyntaxError: Unexpected token < (index):3 Resource interpreted as
Script but transferred with MIME type text/html:
"http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/
".
(index):74 Uncaught ReferenceError: drawApplicationTimeline is not defined
(index):12 Resource interpreted as Image but transferred with MIME type
text/html:
"http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/
" *
Note that the History GUI at port 18080 and the Hadoop GUI at port 8088
work fine, and the Spark jobs GUI does partly render. So, it seems that my
browser proxy is not the cause of this problem.

Joshua


Re: localhost webui port

2015-10-13 Thread Saisai Shao
By configuring "spark.ui.port" to the port you could bind.

On Tue, Oct 13, 2015 at 8:47 PM, Langston, Jim 
wrote:

> Hi all,
>
> Is there anyway to change the default port 4040 for the localhost webUI,
> unfortunately, that port is blocked and I have no control of that. I have
> not found any configuration parameter that would enable me to change it.
>
> Thanks,
>
> Jim
>


Re: an problem about zippartition

2015-10-13 Thread Saisai Shao
You have to call the checkpoint regularly on rdd0 to cut the dependency
chain, otherwise you will meet such problem as you mentioned, even stack
overflow finally. This is a classic problem for high iterative job, you
could google it for the fix solution.

On Tue, Oct 13, 2015 at 7:09 PM, 张仪yf1  wrote:

>
>
>
>
>
>
> Hi,there
>
>  I problem an issue when using the zippartition, first I created a
> rdd from a seq,then created another one,and zippartitioned them with rdd3,
> then cached the rdd3,then created a new rdd ,and zippartitioned it with
> rdd3.I repeat this operation many times, and I found that,
>
> The task serialized became bigger and bigger, the serialize time cost
> became bigger too.Does anyone else encounter the same problem ,please help.
>
> Code:
>
> *def* main(args: Array[String]) {
>   *var *i: Int = 0
>   *var *seq = *Seq*[String]()
>   *while *(i < 720) {
> {
>   seq = seq.+:(
> *"aa"*
> )
> }
> i = i + 1
>   }
>   *val *sc = *new *SparkContext(*"spark://hdh010016146205:7077"*,
> *"ScalaTest"*)
>   *var *rdd0: RDD[String] =
> *null   var *j = 0;
>   *while *(j < 200) {
> j = j + 1
> *val *rdd1 = sc.parallelize(seq, 72)
> *if *(rdd0 == *null*) {
>   rdd0 = rdd1
>   rdd0.cache()
>   rdd0.count()
> } *else *{
>   *val *rdd2 = *new *BillZippedPartitionsRDD2[String, String,
> String](sc, { (thisIter, otherIter) =>
>
> *//val rdd2 = rdd0.zipPartitions(rdd1, true)({ (thisIter,
> otherIter) => **new *Iterator[String] {
>   *def *hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) 
> *match
> *{
> *case *(*true*, *true*) =>
> *true case *(*false*, *false*) =>
> *false case *_ => *throw new *SparkException(*"Can only zip
> RDDs with " *+
>   *"same number of elements in each partition"*)
>   }
>   *def *next(): String = (thisIter.next() + *"--" *+
> otherIter.next())
> }
>   }, rdd0, rdd1, *false*)
>   rdd2.cache()
>   rdd2.count()
>   rdd0 = rdd2
> }
>   }
> }
>
>
>


localhost webui port

2015-10-13 Thread Langston, Jim
Hi all,

Is there anyway to change the default port 4040 for the localhost webUI,
unfortunately, that port is blocked and I have no control of that. I have
not found any configuration parameter that would enable me to change it.

Thanks,

Jim


Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
Can you pastebin your Java code and the command you used to compile ?

Thanks

> On Oct 13, 2015, at 1:42 AM, Umesh Kacha  wrote:
> 
> Hi Ted if fix went after 1.5.1 release then how come it's working with 1.5.1 
> binary in spark-shell.
> 
>> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>> Looks like the fix went in after 1.5.1 was released. 
>> 
>> You may verify using master branch build. 
>> 
>> Cheers
>> 
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>> 
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you 
>>> mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1 
>>> maven libraries it still complains same that callUdf can have string and 
>>> column types only. Please guide.
>>> 
 On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
 SQL context available as sqlContext.
 
 scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
 df: org.apache.spark.sql.DataFrame = [id: string, value: int]
 
 scala> df.select(callUDF("percentile_approx",col("value"), 
 lit(0.25))).show()
 +--+
 |'percentile_approx(value,0.25)|
 +--+
 |   1.0|
 +--+
 
 Can you upgrade to 1.5.1 ?
 
 Cheers
 
> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha  
> wrote:
> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available 
> in Spark 1.4.0 as per JAvadocx
> 
>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha  
>> wrote:
>> Hi Ted thanks much for the detailed answer and appreciate your efforts. 
>> Do we need to register Hive UDFs?
>> 
>> sqlContext.udf.register("percentile_approx");???//is it valid?
>> 
>> I am calling Hive UDF percentile_approx in the following manner which 
>> gives compilation error
>> 
>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>  error
>> 
>> //compile error because callUdf() takes String and Column* as arguments.
>> 
>> Please guide. Thanks much.
>> 
>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>> Using spark-shell, I did the following exercise (master branch) :
>>> 
>>> 
>>> SQL context available as sqlContext.
>>> 
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", 
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>> 
>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * 
>>> v + cnst)
>>> res0: org.apache.spark.sql.UserDefinedFunction = 
>>> UserDefinedFunction(,IntegerType,List())
>>> 
>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>> +---++
>>> | id|'simpleUDF(value,25)|
>>> +---++
>>> |id1|  26|
>>> |id2|  41|
>>> |id3|  50|
>>> +---++
>>> 
>>> Which Spark release are you using ?
>>> 
>>> Can you pastebin the full stack trace where you got the error ?
>>> 
>>> Cheers
>>> 
 On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha  
 wrote:
 I have a doubt Michael I tried to use callUDF in  the following code 
 it does not work. 
 
 sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
 
 Above code does not compile because callUdf() takes only two arguments 
 function name in String and Column class type. Please guide.
 
> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha  
> wrote:
> thanks much Michael let me try. 
> 
>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust 
>>  wrote:
>> This is confusing because I made a typo...
>> 
>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>> 
>> The first argument is the name of the UDF, all other arguments need 
>> to be columns that are passed in as arguments.  lit is just saying 
>> to make a literal column that always has the value 0.25.
>> 
>>> On Fri, Oct 9, 2015 at 12:16 PM,  
>>> wrote:
>>> Yes but I mean, this is rather curious. How is def lit(literal:Any) 
>>> --> becomes a percentile function lit(25)
>>> 
>>>  
>>> 
>>> Thanks for clarification
>>> 
>>> Saif
>>> 
>>>  
>>> 
>>> From: Umesh Kacha [mailto:umesh.ka...@gmail.com] 
>>> Sent: Friday, October 09, 

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
OK thanks much Ted looks like some issue while using maven dependencies in
Java code for 1.5.1. I am still not able to understand if spark 1.5.1
binary in spark-shell can recognize callUdf then why not callUdf not
getting compiled while using maven build.
On Oct 13, 2015 2:20 PM, "Ted Yu"  wrote:

> Pardon me.
> I didn't read your previous response clearly.
>
> I will try to reproduce the compilation error on master branch.
> Right now, I have some other high priority task on hand.
>
> BTW I was looking at SPARK-10671
>
> FYI
>
> On Tue, Oct 13, 2015 at 1:42 AM, Umesh Kacha 
> wrote:
>
>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>> 1.5.1 binary in spark-shell.
>> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>>
>>> Looks like the fix went in after 1.5.1 was released.
>>>
>>> You may verify using master branch build.
>>>
>>> Cheers
>>>
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>>
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>> and column types only. Please guide.
>>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>>>
 SQL context available as sqlContext.

 scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
 "value")
 df: org.apache.spark.sql.DataFrame = [id: string, value: int]

 scala> df.select(callUDF("percentile_approx",col("value"),
 lit(0.25))).show()
 +--+
 |'percentile_approx(value,0.25)|
 +--+
 |   1.0|
 +--+

 Can you upgrade to 1.5.1 ?

 Cheers

 On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
 wrote:

> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
> available in Spark 1.4.0 as per JAvadocx
>
> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
> wrote:
>
>> Hi Ted thanks much for the detailed answer and appreciate your
>> efforts. Do we need to register Hive UDFs?
>>
>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>
>> I am calling Hive UDF percentile_approx in the following manner which
>> gives compilation error
>>
>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>> error
>>
>> //compile error because callUdf() takes String and Column* as
>> arguments.
>>
>> Please guide. Thanks much.
>>
>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>
>>> Using spark-shell, I did the following exercise (master branch) :
>>>
>>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>> * v + cnst)
>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>> UserDefinedFunction(,IntegerType,List())
>>>
>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>> lit(25))).show()
>>> +---++
>>> | id|'simpleUDF(value,25)|
>>> +---++
>>> |id1|  26|
>>> |id2|  41|
>>> |id3|  50|
>>> +---++
>>>
>>> Which Spark release are you using ?
>>>
>>> Can you pastebin the full stack trace where you got the error ?
>>>
>>> Cheers
>>>
>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>> wrote:
>>>
 I have a doubt Michael I tried to use callUDF in  the following
 code it does not work.

 sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))

 Above code does not compile because callUdf() takes only two
 arguments function name in String and Column class type. Please guide.

 On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha  wrote:

> thanks much Michael let me try.
>
> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> This is confusing because I made a typo...
>>
>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>
>> The first argument is the name of the UDF, all other arguments
>> need to be columns that are passed in as arguments.  lit is just 
>> saying to
>> make a literal column that always has the value 

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Jean-Baptiste Onofré

Hi Joshua,

What's the Spark version and what's your browser ?

I just tried on Spark 1.6-SNAPSHOT with firefox and it works fine.

Thanks
Regards
JB

On 10/13/2015 02:17 PM, Joshua Fox wrote:

I am accessing the Spark Jobs Web GUI, running on AWS EMR.

I can access this webapp (port 4040 as per default), but it only
half-renders, producing "Uncaught SyntaxError: Unexpected token <"

Here is a screenshot  including Chrome
Developer Console.

Screenshot 

Here are some of the error messages in my Chrome console.

/Uncaught SyntaxError: Unexpected token <
(index):3 Resource interpreted as Script but transferred with MIME type
text/html:
"http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;.
(index):74 Uncaught ReferenceError: drawApplicationTimeline is not defined
(index):12 Resource interpreted as Image but transferred with MIME type
text/html:
"http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;

/
Note that the History GUI at port 18080 and the Hadoop GUI at port 8088
work fine, and the Spark jobs GUI does partly render. So, it seems that
my browser proxy is not the cause of this problem.

Joshua


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted I am using the following line of code I can't paste entire code
sorry but the following only line doesn't compile in my spark job

 sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))

I am using Intellij editor java and maven dependencies of spark core spark
sql spark hive version 1.5.1
On Oct 13, 2015 18:21, "Ted Yu"  wrote:

> Can you pastebin your Java code and the command you used to compile ?
>
> Thanks
>
> On Oct 13, 2015, at 1:42 AM, Umesh Kacha  wrote:
>
> Hi Ted if fix went after 1.5.1 release then how come it's working with
> 1.5.1 binary in spark-shell.
> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>
>> Looks like the fix went in after 1.5.1 was released.
>>
>> You may verify using master branch build.
>>
>> Cheers
>>
>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>
>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>> 1.5.1 maven libraries it still complains same that callUdf can have string
>> and column types only. Please guide.
>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>> lit(0.25))).show()
>>> +--+
>>> |'percentile_approx(value,0.25)|
>>> +--+
>>> |   1.0|
>>> +--+
>>>
>>> Can you upgrade to 1.5.1 ?
>>>
>>> Cheers
>>>
>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
>>> wrote:
>>>
 Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
 available in Spark 1.4.0 as per JAvadocx

 On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
 wrote:

> Hi Ted thanks much for the detailed answer and appreciate your
> efforts. Do we need to register Hive UDFs?
>
> sqlContext.udf.register("percentile_approx");???//is it valid?
>
> I am calling Hive UDF percentile_approx in the following manner which
> gives compilation error
>
> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
> error
>
> //compile error because callUdf() takes String and Column* as
> arguments.
>
> Please guide. Thanks much.
>
> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>
>> Using spark-shell, I did the following exercise (master branch) :
>>
>>
>> SQL context available as sqlContext.
>>
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>> "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>
>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>> * v + cnst)
>> res0: org.apache.spark.sql.UserDefinedFunction =
>> UserDefinedFunction(,IntegerType,List())
>>
>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>> lit(25))).show()
>> +---++
>> | id|'simpleUDF(value,25)|
>> +---++
>> |id1|  26|
>> |id2|  41|
>> |id3|  50|
>> +---++
>>
>> Which Spark release are you using ?
>>
>> Can you pastebin the full stack trace where you got the error ?
>>
>> Cheers
>>
>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>> wrote:
>>
>>> I have a doubt Michael I tried to use callUDF in  the following code
>>> it does not work.
>>>
>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>
>>> Above code does not compile because callUdf() takes only two
>>> arguments function name in String and Column class type. Please guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>> wrote:
>>>
 thanks much Michael let me try.

 On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
 mich...@databricks.com> wrote:

> This is confusing because I made a typo...
>
> callUDF("percentile_approx", col("mycol"), lit(0.25))
>
> The first argument is the name of the UDF, all other arguments
> need to be columns that are passed in as arguments.  lit is just 
> saying to
> make a literal column that always has the value 0.25.
>
> On Fri, Oct 9, 2015 at 12:16 PM, 
> wrote:
>
>> Yes but I mean, this is rather curious. How is def
>> lit(literal:Any) --> 

Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Has anyone tried shuffle service in Stand Alone cluster mode? I want to enable 
it for d.a. but my jobs never start when I submit them.
This happens with all my jobs.



15/10/13 08:29:45 INFO DAGScheduler: Job 0 failed: json at DataLoader.scala:86, 
took 16.318615 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost 
task 0.3 in stage 0.0 (TID 7, 162.101.194.47): ExecutorLostFailure (executor 4 
lost)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1942)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1003)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
at 
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1114)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1091)
at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:219)
at org.apache.saif.loaders.DataLoader$.load_json(DataLoader.scala:86)




Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Jonathan Kelly
Joshua,

Since Spark is configured to run on YARN in EMR, instead of viewing the
Spark application UI at port 4040, you should instead start from the YARN
ResourceManager (on port 8088), then click on the ApplicationMaster link
for the Spark application you are interested in. This will take you to the
YARN ProxyServer on port 20888, which will proxy you through to the Spark
UI for the application (which renders correctly when viewed this way). This
works even if the Spark UI is running on a port other than 4040 and even in
yarn-cluster mode when the Spark driver is running on a slave node.

Hope this helps,
Jonathan

On Tue, Oct 13, 2015 at 7:19 AM, Jean-Baptiste Onofré 
wrote:

> Thanks for the update Joshua.
>
> Let me try with Spark 1.4.1.
>
> I keep you posted.
>
> Regards
> JB
>
> On 10/13/2015 04:17 PM, Joshua Fox wrote:
>
>>   * Spark 1.4.1, part of EMR emr-4.0.0
>>   * Chrome Version 41.0.2272.118 (64-bit) on Ubuntu
>>
>>
>> On Tue, Oct 13, 2015 at 3:27 PM, Jean-Baptiste Onofré > > wrote:
>>
>> Hi Joshua,
>>
>> What's the Spark version and what's your browser ?
>>
>> I just tried on Spark 1.6-SNAPSHOT with firefox and it works fine.
>>
>> Thanks
>> Regards
>> JB
>>
>> On 10/13/2015 02:17 PM, Joshua Fox wrote:
>>
>> I am accessing the Spark Jobs Web GUI, running on AWS EMR.
>>
>> I can access this webapp (port 4040 as per default), but it only
>> half-renders, producing "Uncaught SyntaxError: Unexpected token <"
>>
>> Here is a screenshot  including
>> Chrome
>> Developer Console.
>>
>> Screenshot 
>>
>> Here are some of the error messages in my Chrome console.
>>
>> /Uncaught SyntaxError: Unexpected token <
>> (index):3 Resource interpreted as Script but transferred with
>> MIME type
>> text/html:
>> "
>> http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;.
>> (index):74 Uncaught ReferenceError: drawApplicationTimeline is
>> not defined
>> (index):12 Resource interpreted as Image but transferred with
>> MIME type
>> text/html:
>> "
>> http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;
>>
>> /
>> Note that the History GUI at port 18080 and the Hadoop GUI at
>> port 8088
>> work fine, and the Spark jobs GUI does partly render. So, it
>> seems that
>> my browser proxy is not the cause of this problem.
>>
>> Joshua
>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org 
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Jean-Baptiste Onofré

Thanks for the update Joshua.

Let me try with Spark 1.4.1.

I keep you posted.

Regards
JB

On 10/13/2015 04:17 PM, Joshua Fox wrote:

  * Spark 1.4.1, part of EMR emr-4.0.0
  * Chrome Version 41.0.2272.118 (64-bit) on Ubuntu


On Tue, Oct 13, 2015 at 3:27 PM, Jean-Baptiste Onofré > wrote:

Hi Joshua,

What's the Spark version and what's your browser ?

I just tried on Spark 1.6-SNAPSHOT with firefox and it works fine.

Thanks
Regards
JB

On 10/13/2015 02:17 PM, Joshua Fox wrote:

I am accessing the Spark Jobs Web GUI, running on AWS EMR.

I can access this webapp (port 4040 as per default), but it only
half-renders, producing "Uncaught SyntaxError: Unexpected token <"

Here is a screenshot  including
Chrome
Developer Console.

Screenshot 

Here are some of the error messages in my Chrome console.

/Uncaught SyntaxError: Unexpected token <
(index):3 Resource interpreted as Script but transferred with
MIME type
text/html:
"http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;.
(index):74 Uncaught ReferenceError: drawApplicationTimeline is
not defined
(index):12 Resource interpreted as Image but transferred with
MIME type
text/html:
"http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;

/
Note that the History GUI at port 18080 and the Hadoop GUI at
port 8088
work fine, and the Spark jobs GUI does partly render. So, it
seems that
my browser proxy is not the cause of this problem.

Joshua


--
Jean-Baptiste Onofré
jbono...@apache.org 
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Joshua Fox
   - Spark 1.4.1, part of EMR emr-4.0.0
   - Chrome Version 41.0.2272.118 (64-bit) on Ubuntu


On Tue, Oct 13, 2015 at 3:27 PM, Jean-Baptiste Onofré 
wrote:

> Hi Joshua,
>
> What's the Spark version and what's your browser ?
>
> I just tried on Spark 1.6-SNAPSHOT with firefox and it works fine.
>
> Thanks
> Regards
> JB
>
> On 10/13/2015 02:17 PM, Joshua Fox wrote:
>
>> I am accessing the Spark Jobs Web GUI, running on AWS EMR.
>>
>> I can access this webapp (port 4040 as per default), but it only
>> half-renders, producing "Uncaught SyntaxError: Unexpected token <"
>>
>> Here is a screenshot  including Chrome
>> Developer Console.
>>
>> Screenshot 
>>
>> Here are some of the error messages in my Chrome console.
>>
>> /Uncaught SyntaxError: Unexpected token <
>> (index):3 Resource interpreted as Script but transferred with MIME type
>> text/html:
>> "http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;.
>> (index):74 Uncaught ReferenceError: drawApplicationTimeline is not defined
>> (index):12 Resource interpreted as Image but transferred with MIME type
>> text/html:
>> "http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;
>>
>> /
>> Note that the History GUI at port 18080 and the Hadoop GUI at port 8088
>> work fine, and the Spark jobs GUI does partly render. So, it seems that
>> my browser proxy is not the cause of this problem.
>>
>> Joshua
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark shuffle service does not work in stand alone

2015-10-13 Thread Jean-Baptiste Onofré

Hi,

AFAIK, the shuffle service makes sense only to delegate the shuffle to 
mapreduce (as mapreduce shuffle is most of the time faster than the 
spark shuffle).

As you run in standalone mode, shuffle service will use the spark shuffle.

Not 100% thought.

Regards
JB

On 10/13/2015 04:23 PM, saif.a.ell...@wellsfargo.com wrote:

Has anyone tried shuffle service in Stand Alone cluster mode? I want to
enable it for d.a. but my jobs never start when I submit them.
This happens with all my jobs.
15/10/13 08:29:45 INFO DAGScheduler: Job 0 failed: json at
DataLoader.scala:86, took 16.318615 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted
due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent
failure: Lost task 0.3 in stage 0.0 (TID 7, 162.101.194.47):
ExecutorLostFailure (executor 4 lost)
Driver stacktrace:
 at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
 at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 at scala.Option.foreach(Option.scala:236)
 at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
 at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
 at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
 at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1942)
 at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1003)
 at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
 at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
 at
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1114)
 at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
 at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
 at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1091)
 at
org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
 at
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
 at
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
 at scala.Option.getOrElse(Option.scala:120)
 at
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
 at
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
 at
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
 at
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
 at
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31)
 at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
 at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
 at
org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:219)
 at
org.apache.saif.loaders.DataLoader$.load_json(DataLoader.scala:86)


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why is my spark executor is terminated?

2015-10-13 Thread Wang, Ningjun (LNG-NPV)
We use spark on windows 2008 R2 servers. We use one spark context which create 
one spark executor. We run spark master, slave, driver, executor on one single 
machine.

>From time to time, we found that the executor JAVA process was terminated. I 
>cannot fig out why it was terminated. Can anybody help me on how to find out 
>why the executor was terminated?

The spark slave log. It shows that it kill the executor process
2015-10-13 09:58:06,087 INFO  [sparkWorker-akka.actor.default-dispatcher-16] 
worker.Worker (Logging.scala:logInfo(59)) - Asked to kill executor 
app-20151009201453-/0

But why does it do that?

Here is the detailed logs from spark slave

2015-10-13 09:58:04,915 WARN  [sparkWorker-akka.actor.default-dispatcher-16] 
remote.ReliableDeliverySupervisor (Slf4jLogger.scala:apply$mcV$sp(71)) - 
Association with remote system 
[akka.tcp://sparkexecu...@qa1-cas01.pcc.lexisnexis.com:61234] has failed, 
address is now gated for [5000] ms. Reason is: [Disassociated].
2015-10-13 09:58:05,134 INFO  [sparkWorker-akka.actor.default-dispatcher-16] 
actor.LocalActorRef (Slf4jLogger.scala:apply$mcV$sp(74)) - Message 
[akka.remote.EndpointWriter$AckIdleCheckTimer$] from 
Actor[akka://sparkWorker/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40QA1-CAS01.pcc.lexisnexis.com%3A61234-2/endpointWriter#-175670388]
 to 
Actor[akka://sparkWorker/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40QA1-CAS01.pcc.lexisnexis.com%3A61234-2/endpointWriter#-175670388]
 was not delivered. [2] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
2015-10-13 09:58:05,134 INFO  [sparkWorker-akka.actor.default-dispatcher-16] 
actor.LocalActorRef (Slf4jLogger.scala:apply$mcV$sp(74)) - Message 
[akka.remote.transport.AssociationHandle$Disassociated] from 
Actor[akka://sparkWorker/deadLetters] to 
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.196.116.184%3A61236-3#-1210125680]
 was not delivered. [3] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
2015-10-13 09:58:05,134 INFO  [sparkWorker-akka.actor.default-dispatcher-16] 
actor.LocalActorRef (Slf4jLogger.scala:apply$mcV$sp(74)) - Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
Actor[akka://sparkWorker/deadLetters] to 
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.196.116.184%3A61236-3#-1210125680]
 was not delivered. [4] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
2015-10-13 09:58:06,087 INFO  [sparkWorker-akka.actor.default-dispatcher-16] 
worker.Worker (Logging.scala:logInfo(59)) - Asked to kill executor 
app-20151009201453-/0
2015-10-13 09:58:06,103 INFO  [ExecutorRunner for app-20151009201453-/0] 
worker.ExecutorRunner (Logging.scala:logInfo(59)) - Runner thread for executor 
app-20151009201453-/0 interrupted
2015-10-13 09:58:06,118 INFO  [ExecutorRunner for app-20151009201453-/0] 
worker.ExecutorRunner (Logging.scala:logInfo(59)) - Killing process!
2015-10-13 09:58:06,509 INFO  [sparkWorker-akka.actor.default-dispatcher-16] 
worker.Worker (Logging.scala:logInfo(59)) - Executor app-20151009201453-/0 
finished with state KILLED exitStatus 1
2015-10-13 09:58:06,509 INFO  [sparkWorker-akka.actor.default-dispatcher-16] 
worker.Worker (Logging.scala:logInfo(59)) - Cleaning up local directories for 
application app-20151009201453-

Thanks
Ningjun Wang



Re: Why is my spark executor is terminated?

2015-10-13 Thread Jean-Baptiste Onofré

Hi Ningjun,

Nothing special in the master log ?

Regards
JB

On 10/13/2015 04:34 PM, Wang, Ningjun (LNG-NPV) wrote:

We use spark on windows 2008 R2 servers. We use one spark context which
create one spark executor. We run spark master, slave, driver, executor
on one single machine.

 From time to time, we found that the executor JAVA process was
terminated. I cannot fig out why it was terminated. Can anybody help me
on how to find out why the executor was terminated?

The spark slave log. It shows that it kill the executor process

2015-10-13 09:58:06,087 INFO
[sparkWorker-akka.actor.default-dispatcher-16] worker.Worker
(Logging.scala:logInfo(59)) - Asked to kill executor
app-20151009201453-/0

But why does it do that?

Here is the detailed logs from spark slave

2015-10-13 09:58:04,915 WARN
[sparkWorker-akka.actor.default-dispatcher-16]
remote.ReliableDeliverySupervisor (Slf4jLogger.scala:apply$mcV$sp(71)) -
Association with remote system
[akka.tcp://sparkexecu...@qa1-cas01.pcc.lexisnexis.com:61234] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2015-10-13 09:58:05,134 INFO
[sparkWorker-akka.actor.default-dispatcher-16] actor.LocalActorRef
(Slf4jLogger.scala:apply$mcV$sp(74)) - Message
[akka.remote.EndpointWriter$AckIdleCheckTimer$] from
Actor[akka://sparkWorker/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40QA1-CAS01.pcc.lexisnexis.com%3A61234-2/endpointWriter#-175670388]
to
Actor[akka://sparkWorker/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40QA1-CAS01.pcc.lexisnexis.com%3A61234-2/endpointWriter#-175670388]
was not delivered. [2] dead letters encountered. This logging can be
turned off or adjusted with configuration settings
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

2015-10-13 09:58:05,134 INFO
[sparkWorker-akka.actor.default-dispatcher-16] actor.LocalActorRef
(Slf4jLogger.scala:apply$mcV$sp(74)) - Message
[akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.196.116.184%3A61236-3#-1210125680]
was not delivered. [3] dead letters encountered. This logging can be
turned off or adjusted with configuration settings
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

2015-10-13 09:58:05,134 INFO
[sparkWorker-akka.actor.default-dispatcher-16] actor.LocalActorRef
(Slf4jLogger.scala:apply$mcV$sp(74)) - Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
from Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.196.116.184%3A61236-3#-1210125680]
was not delivered. [4] dead letters encountered. This logging can be
turned off or adjusted with configuration settings
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

2015-10-13 09:58:06,087 INFO
[sparkWorker-akka.actor.default-dispatcher-16] worker.Worker
(Logging.scala:logInfo(59)) - Asked to kill executor
app-20151009201453-/0

2015-10-13 09:58:06,103 INFO  [ExecutorRunner for
app-20151009201453-/0] worker.ExecutorRunner
(Logging.scala:logInfo(59)) - Runner thread for executor
app-20151009201453-/0 interrupted

2015-10-13 09:58:06,118 INFO  [ExecutorRunner for
app-20151009201453-/0] worker.ExecutorRunner
(Logging.scala:logInfo(59)) - Killing process!

2015-10-13 09:58:06,509 INFO
[sparkWorker-akka.actor.default-dispatcher-16] worker.Worker
(Logging.scala:logInfo(59)) - Executor app-20151009201453-/0
finished with state KILLED exitStatus 1

2015-10-13 09:58:06,509 INFO
[sparkWorker-akka.actor.default-dispatcher-16] worker.Worker
(Logging.scala:logInfo(59)) - Cleaning up local directories for
application app-20151009201453-

Thanks

Ningjun Wang



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-13 Thread David Bess
Got it working!  Thank you for confirming my suspicion that this issue was
related to Java.  When I dug deeper I found multiple versions and some other
issues.  I worked on it a while before deciding it would be easier to just
uninstall all Java and reinstall clean JDK, and now it works perfectly.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Install-via-directions-in-Learning-Spark-Exception-when-running-bin-pyspark-tp25043p25049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem installing Sparck on Windows 8

2015-10-13 Thread Steve Loughran

On 12 Oct 2015, at 23:11, Marco Mistroni 
> wrote:

HI all
 i have downloaded spark-1.5.1-bin-hadoop.2.4
i have extracted it on my machine, but when i go to the \bin directory and 
invoke
spark-shell i get the following exception

Could anyone assist pls?


you've hit this https://wiki.apache.org/hadoop/WindowsProblems

I followed instructions in ebook Learning Spark, but mayb the instructions are 
old?
kr
marco


15/10/12 23:10:29 WARN ObjectStore: Failed to get database default, returning No
SuchObjectException
15/10/12 23:10:30 WARN : Your hostname, MarcoLaptop resolves to a loopback/non-r
eachable address: fe80:0:0:0:c5ed:a66d:9d95:5caa%wlan2, but we couldn't find any
 external IP address!
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav
a:522)
at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.s
cala:171)
at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveCo
ntext.scala:162)
at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala
:160)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:10
28)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:
1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:
1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840
)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:8
57)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.sca
la:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply
(SparkILoopInit.scala:132)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply
(SparkILoopInit.scala:124)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoop
Init.scala:124)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)

at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$Spark
ILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.s
cala:159)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkIL
oopInit.scala:108)
at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:
64)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$Spark
ILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$Spark
ILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$Spark
ILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
Loader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$pr
ocess(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 

Conf setting for Java Spark

2015-10-13 Thread Ramkumar V
Hi,

I'm using java over spark for processing 30 GB of data every hour. I'm
doing spark-submit in cluster mode. I have a cluster of 11 machines (9 - 64
GB memory and 2 - 32 GB memory ) but it takes 30 mins to process 30 GB of
data every hour. How can i optimize this ? How to compute the driver and
executor memory according to machine configuration ? I'm using following
spark configuration.

 sparkConf.setMaster("yarn-cluster");
 sparkConf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer");
 sparkConf.set("spark.driver.memory", "2g");
 sparkConf.set("spark.executor.memory", "2g");
 sparkConf.set("spark.storage.memoryFraction", "0.5");
 sparkConf.set("spark.shuffle.memoryFraction", "0.4" );

*Thanks*,



Re: Spark shuffle service does not work in stand alone

2015-10-13 Thread Marcelo Vanzin
It would probably be more helpful if you looked for the executor error and
posted it. The screenshot you posted is the driver exception caused by the
task failure, which is not terribly useful.

On Tue, Oct 13, 2015 at 7:23 AM,  wrote:

> Has anyone tried shuffle service in Stand Alone cluster mode? I want to
> enable it for d.a. but my jobs never start when I submit them.
> This happens with all my jobs.
>
>
> 15/10/13 08:29:45 INFO DAGScheduler: Job 0 failed: json at
> DataLoader.scala:86, took 16.318615 s
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent
> failure: Lost task 0.3 in stage 0.0 (TID 7, 162.101.194.47):
> ExecutorLostFailure (executor 4 lost)
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1942)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1003)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
> at
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1114)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1091)
> at
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
> at
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
> at
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
> at
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
> at
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
> at
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
> at
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
> at
> org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:219)
> at
> org.apache.saif.loaders.DataLoader$.load_json(DataLoader.scala:86)
>
>
>



-- 
Marcelo


Re: sql query orc slow

2015-10-13 Thread Zhan Zhang
Hi Patcharee,

I am not sure which side is wrong, driver or executor. If it is executor side, 
the reason you mentioned may be possible. But if the driver side didn’t set the 
predicate at all, then somewhere else is broken.

Can you please file a JIRA with a simple reproduce step, and let me know the 
JIRA number?

Thanks.

Zhan Zhang

On Oct 13, 2015, at 1:01 AM, Patcharee Thongtra 
> wrote:

Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE clause even 
though spark.sql.orc.filterPushdown=true) can be related to some factors below ?

- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee


On 10/09/2015 08:01 PM, Zhan Zhang wrote:
That is weird. Unfortunately, there is no debug info available on this part. 
Can you please open a JIRA to add some debug information on the driver side?

Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But from the 
log No ORC pushdown predicate for my query with WHERE clause.

15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate is right 
based on your setting, : leaf-0 = (EQUALS x 320)

The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 
- v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y 
= 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is 
not partition column, the others are partition columns. I expected the system 
will use predicate pushdown. I turned on the debug and found pushdown predicate 
was not generated ("DEBUG OrcInputFormat: No ORC pushdown predicate")

Then I tried to set the search argument explicitly (on the column "x" which is 
not partition column)

   val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()
   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was wrong (no 
results at all)

15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS 
x 320)
expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not applied 
by the system?

BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:
Hi Patcharee,

>From the query, it looks like only the column pruning will be applied. 
>Partition pruning and predicate pushdown does not have effect. Do you see big 
>IO difference between two methods?

The potential reason of the speed difference I can think of may be the 
different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, 
but the spark path use OrcInputFormat.

Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

Yes, the predicate pushdown is enabled, but still take longer time than the 
first method

BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:
Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

Hi,

I am using spark sql 1.5 to query a hive table stored as partitioned orc file. 
We have the total files is about 6000 files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the temp table

val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 files) , the 
second case is much slower then the first one. Any ideas why?

BR,




-
To unsubscribe, e-mail:  
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Hi, thanks

Executors are simply failing to connect to a shuffle server:

15/10/13 08:29:34 INFO BlockManagerMaster: Registered BlockManager
15/10/13 08:29:34 INFO BlockManager: Registering executor with local external 
shuffle service.
15/10/13 08:29:34 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 2 more times after waiting 5 seconds...
java.io.IOException: Failed to connect to /162.xxx.zzz.yy:port
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:140)
at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:220)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:217)
at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:203)
at org.apache.spark.executor.Executor.(Executor.scala:85)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:86)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.net.ConnectException: Connection refused: /162.xxx.zzz.yy:port
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)


From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Tuesday, October 13, 2015 1:13 PM
To: Ellafi, Saif A.
Cc: user@spark.apache.org
Subject: Re: Spark shuffle service does not work in stand alone

It would probably be more helpful if you looked for the executor error and 
posted it. The screenshot you posted is the driver exception caused by the task 
failure, which is not terribly useful.

On Tue, Oct 13, 2015 at 7:23 AM, 
> wrote:
Has anyone tried 

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
I believe the confusion here is self-answered.
The thing is that in the documentation, the spark shuffle service runs only 
under YARN, while here we are speaking about a stand alone cluster.

The proper question is, how to launch a shuffle service for stand alone?

Saif

From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
Sent: Tuesday, October 13, 2015 2:25 PM
To: van...@cloudera.com
Cc: user@spark.apache.org
Subject: RE: Spark shuffle service does not work in stand alone

Hi, thanks

Executors are simply failing to connect to a shuffle server:

15/10/13 08:29:34 INFO BlockManagerMaster: Registered BlockManager
15/10/13 08:29:34 INFO BlockManager: Registering executor with local external 
shuffle service.
15/10/13 08:29:34 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 2 more times after waiting 5 seconds...
java.io.IOException: Failed to connect to /162.xxx.zzz.yy:port
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:140)
at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:220)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:217)
at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:203)
at org.apache.spark.executor.Executor.(Executor.scala:85)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:86)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.net.ConnectException: Connection refused: /162.xxx.zzz.yy:port
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)


From: Marcelo Vanzin 

Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra

Hi Zhan Zhang,

Here is the issue https://issues.apache.org/jira/browse/SPARK-11087

BR,
Patcharee

On 10/13/2015 06:47 PM, Zhan Zhang wrote:

Hi Patcharee,

I am not sure which side is wrong, driver or executor. If it is 
executor side, the reason you mentioned may be possible. But if the 
driver side didn’t set the predicate at all, then somewhere else is 
broken.


Can you please file a JIRA with a simple reproduce step, and let me 
know the JIRA number?


Thanks.

Zhan Zhang

On Oct 13, 2015, at 1:01 AM, Patcharee Thongtra 
> wrote:



Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE 
clause even though spark.sql.orc.filterPushdown=true) can be related 
to some factors below ?


- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee


On 10/09/2015 08:01 PM, Zhan Zhang wrote:
That is weird. Unfortunately, there is no debug info available on 
this part. Can you please open a JIRA to add some debug information 
on the driver side?


Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee  
wrote:


I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). 
But from the log No ORC pushdown predicate for my query with WHERE 
clause.


15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate 
is right based on your setting, : leaf-0 = (EQUALS x 320)


The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee  
wrote:



Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z 
from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and 
z >= 2 and z <= 8", column "x", "y" is not partition column, the 
others are partition columns. I expected the system will use 
predicate pushdown. I turned on the debug and found pushdown 
predicate was not generated ("DEBUG OrcInputFormat: No ORC 
pushdown predicate")


Then I tried to set the search argument explicitly (on the column 
"x" which is not partition column)


   val xs = 
SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results 
was wrong (no results at all)


15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: 
leaf-0 = (EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is 
not applied by the system?


BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be 
applied. Partition pruning and predicate pushdown does not have 
effect. Do you see big IO difference between two methods?


The potential reason of the speed difference I can think of may 
be the different versions of OrcInputFormat. The hive path may 
use NewOrcInputFormat, but the spark path use OrcInputFormat.


Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee 
 wrote:


Yes, the predicate pushdown is enabled, but still take longer 
time than the first method


BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
 wrote:



Hi,

I am using spark sql 1.5 to query a hive table stored as 
partitioned orc file. We have the total files is about 6000 
files and each file size is about 245MB.


What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from 
the temp table


val c = 
hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")

c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 
6000 files) , the second case is much slower then the first 
one. Any ideas why?


BR,




-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org






















Re: Spark shuffle service does not work in stand alone

2015-10-13 Thread Marcelo Vanzin
You have to manually start the shuffle service if you're not running YARN.
See the "sbin/start-shuffle-service.sh" script.

On Tue, Oct 13, 2015 at 10:29 AM,  wrote:

> I believe the confusion here is self-answered.
>
> The thing is that in the documentation, the spark shuffle service runs
> only under YARN, while here we are speaking about a stand alone cluster.
>
>
>
> The proper question is, how to launch a shuffle service for stand alone?
>
>
>
> Saif
>
>
>
> *From:* saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
>
> *Sent:* Tuesday, October 13, 2015 2:25 PM
> *To:* van...@cloudera.com
> *Cc:* user@spark.apache.org
> *Subject:* RE: Spark shuffle service does not work in stand alone
>
>
>
> Hi, thanks
>
>
>
> Executors are simply failing to connect to a shuffle server:
>
>
>
> 15/10/13 08:29:34 INFO BlockManagerMaster: Registered BlockManager
>
> 15/10/13 08:29:34 INFO BlockManager: Registering executor with local
> external shuffle service.
>
> 15/10/13 08:29:34 ERROR BlockManager: Failed to connect to external
> shuffle server, will retry 2 more times after waiting 5 seconds...
>
> java.io.IOException: Failed to connect to /162.xxx.zzz.yy:port
>
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>
> at
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:140)
>
> at
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:220)
>
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>
> at
> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:217)
>
> at
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:203)
>
> at org.apache.spark.executor.Executor.(Executor.scala:85)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:86)
>
> at org.apache.spark.rpc.akka.AkkaRpcEnv.org
> $apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
>
> at
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
>
> at org.apache.spark.rpc.akka.AkkaRpcEnv.org
> $apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
>
> at
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
>
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>
> at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>
> at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>
> at
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
>
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Caused by: java.net.ConnectException: Connection refused:
> /162.xxx.zzz.yy:port
>
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>
> at
> 

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
I am currently dealing with a high priority bug in another project.

Hope to get back to this soon.

On Tue, Oct 13, 2015 at 11:56 AM, Umesh Kacha  wrote:

> Hi Ted sorry for asking again. Did you get chance to look at compilation
> issue? Thanks much.
>
> Regards.
> On Oct 13, 2015 18:39, "Umesh Kacha"  wrote:
>
>> Hi Ted I am using the following line of code I can't paste entire code
>> sorry but the following only line doesn't compile in my spark job
>>
>>  sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))
>>
>> I am using Intellij editor java and maven dependencies of spark core
>> spark sql spark hive version 1.5.1
>> On Oct 13, 2015 18:21, "Ted Yu"  wrote:
>>
>>> Can you pastebin your Java code and the command you used to compile ?
>>>
>>> Thanks
>>>
>>> On Oct 13, 2015, at 1:42 AM, Umesh Kacha  wrote:
>>>
>>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>>> 1.5.1 binary in spark-shell.
>>> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>>>
 Looks like the fix went in after 1.5.1 was released.

 You may verify using master branch build.

 Cheers

 On Oct 13, 2015, at 12:21 AM, Umesh Kacha 
 wrote:

 Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
 you mentioned it works using 1.5.1 but it doesn't compile in Java using
 1.5.1 maven libraries it still complains same that callUdf can have string
 and column types only. Please guide.
 On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:

> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
> "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> df.select(callUDF("percentile_approx",col("value"),
> lit(0.25))).show()
> +--+
> |'percentile_approx(value,0.25)|
> +--+
> |   1.0|
> +--+
>
> Can you upgrade to 1.5.1 ?
>
> Cheers
>
> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
> wrote:
>
>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>> available in Spark 1.4.0 as per JAvadocx
>>
>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>> wrote:
>>
>>> Hi Ted thanks much for the detailed answer and appreciate your
>>> efforts. Do we need to register Hive UDFs?
>>>
>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>
>>> I am calling Hive UDF percentile_approx in the following manner
>>> which gives compilation error
>>>
>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>> error
>>>
>>> //compile error because callUdf() takes String and Column* as
>>> arguments.
>>>
>>> Please guide. Thanks much.
>>>
>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu 
>>> wrote:
>>>
 Using spark-shell, I did the following exercise (master branch) :


 SQL context available as sqlContext.

 scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
 "value")
 df: org.apache.spark.sql.DataFrame = [id: string, value: int]

 scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) =>
 v * v + cnst)
 res0: org.apache.spark.sql.UserDefinedFunction =
 UserDefinedFunction(,IntegerType,List())

 scala> df.select($"id", callUDF("simpleUDF", $"value",
 lit(25))).show()
 +---++
 | id|'simpleUDF(value,25)|
 +---++
 |id1|  26|
 |id2|  41|
 |id3|  50|
 +---++

 Which Spark release are you using ?

 Can you pastebin the full stack trace where you got the error ?

 Cheers

 On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
 wrote:

> I have a doubt Michael I tried to use callUDF in  the following
> code it does not work.
>
>
> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>
> Above code does not compile because callUdf() takes only two
> arguments function name in String and Column class type. Please guide.
>
> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <
> umesh.ka...@gmail.com> wrote:
>
>> thanks much Michael let me try.
>>
>> On Sat, Oct 10, 2015 at 1:20 AM, Michael 

TTL for saveAsObjectFile()

2015-10-13 Thread antoniosi
Hi,

I am using RDD.saveAsObjectFile() to save the RDD dataset to Tachyon. In
version 0.8, Tachyon will support for TTL for saved file. Is that supported
from Spark as well? Is there a way I could specify an TTL for a saved object
file?

Thanks.

Antonio.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/TTL-for-saveAsObjectFile-tp25051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark DataFrame GroupBy into List

2015-10-13 Thread Michael Armbrust
import org.apache.spark.sql.functions._

df.groupBy("category")
  .agg(callUDF("collect_set", df("id")).as("id_list"))

On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu  wrote:

> Hey Spark users,
>
> I'm trying to group by a dataframe, by appending occurrences into a list
> instead of count.
>
> Let's say we have a dataframe as shown below:
>
> | category | id |
> |  |:--:|
> | A| 1  |
> | A| 2  |
> | B| 3  |
> | B| 4  |
> | C| 5  |
>
> ideally, after some magic group by (reverse explode?):
>
> | category | id_list  |
> |  |  |
> | A| 1,2  |
> | B| 3,4  |
> | C| 5|
>
> any tricks to achieve that? Scala Spark API is preferred. =D
>
> BR,
> Todd Leo
>
>
>
>


RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Thanks, I missed that one.

From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Tuesday, October 13, 2015 2:36 PM
To: Ellafi, Saif A.
Cc: user@spark.apache.org
Subject: Re: Spark shuffle service does not work in stand alone

You have to manually start the shuffle service if you're not running YARN. See 
the "sbin/start-shuffle-service.sh" script.

On Tue, Oct 13, 2015 at 10:29 AM, 
> wrote:
I believe the confusion here is self-answered.
The thing is that in the documentation, the spark shuffle service runs only 
under YARN, while here we are speaking about a stand alone cluster.

The proper question is, how to launch a shuffle service for stand alone?

Saif

From: saif.a.ell...@wellsfargo.com 
[mailto:saif.a.ell...@wellsfargo.com]
Sent: Tuesday, October 13, 2015 2:25 PM
To: van...@cloudera.com
Cc: user@spark.apache.org
Subject: RE: Spark shuffle service does not work in stand alone

Hi, thanks

Executors are simply failing to connect to a shuffle server:

15/10/13 08:29:34 INFO BlockManagerMaster: Registered BlockManager
15/10/13 08:29:34 INFO BlockManager: Registering executor with local external 
shuffle service.
15/10/13 08:29:34 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 2 more times after waiting 5 seconds...
java.io.IOException: Failed to connect to /162.xxx.zzz.yy:port
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:140)
at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:220)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:217)
at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:203)
at org.apache.spark.executor.Executor.(Executor.scala:85)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:86)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.net.ConnectException: Connection refused: /162.xxx.zzz.yy:port
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 

Generated ORC files cause NPE in Hive

2015-10-13 Thread Daniel Haviv
Hi,
We are inserting streaming data into a hive orc table via a simple insert
statement passed to HiveContext.
When trying to read the files generated using Hive 1.2.1 we are getting NPE:
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:290)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
Error while processing row
at
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:52)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:83)
... 17 more
Caused by: java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at org.apache.hadoop.io.Text.set(Text.java:225)
at
org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow$StringExtractorByValue.extract(VectorExtractRow.java:472)
at
org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow.extractRow(VectorExtractRow.java:732)
at
org.apache.hadoop.hive.ql.exec.vector.VectorReduceSinkOperator.process(VectorReduceSinkOperator.java:102)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:138)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
at
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
at
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:45)
... 18 more

Is this a known issue ?


Re: Generated ORC files cause NPE in Hive

2015-10-13 Thread Alexander Pivovarov
Daniel,

Looks like we already have Jira for that error
https://issues.apache.org/jira/browse/HIVE-11431

Could you put details on how to reproduce the issue to the ticket?

Thank you
Alex

On Tue, Oct 13, 2015 at 11:14 AM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi,
> We are inserting streaming data into a hive orc table via a simple insert
> statement passed to HiveContext.
> When trying to read the files generated using Hive 1.2.1 we are getting
> NPE:
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:290)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
> Error while processing row
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:52)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:83)
> ... 17 more
> Caused by: java.lang.NullPointerException
> at java.lang.System.arraycopy(Native Method)
> at org.apache.hadoop.io.Text.set(Text.java:225)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow$StringExtractorByValue.extract(VectorExtractRow.java:472)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow.extractRow(VectorExtractRow.java:732)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorReduceSinkOperator.process(VectorReduceSinkOperator.java:102)
> at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:138)
> at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
> at
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
> at
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:45)
> ... 18 more
>
> Is this a known issue ?
>
>


Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted sorry for asking again. Did you get chance to look at compilation
issue? Thanks much.

Regards.
On Oct 13, 2015 18:39, "Umesh Kacha"  wrote:

> Hi Ted I am using the following line of code I can't paste entire code
> sorry but the following only line doesn't compile in my spark job
>
>  sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))
>
> I am using Intellij editor java and maven dependencies of spark core spark
> sql spark hive version 1.5.1
> On Oct 13, 2015 18:21, "Ted Yu"  wrote:
>
>> Can you pastebin your Java code and the command you used to compile ?
>>
>> Thanks
>>
>> On Oct 13, 2015, at 1:42 AM, Umesh Kacha  wrote:
>>
>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>> 1.5.1 binary in spark-shell.
>> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>>
>>> Looks like the fix went in after 1.5.1 was released.
>>>
>>> You may verify using master branch build.
>>>
>>> Cheers
>>>
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>>
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>> and column types only. Please guide.
>>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>>>
 SQL context available as sqlContext.

 scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
 "value")
 df: org.apache.spark.sql.DataFrame = [id: string, value: int]

 scala> df.select(callUDF("percentile_approx",col("value"),
 lit(0.25))).show()
 +--+
 |'percentile_approx(value,0.25)|
 +--+
 |   1.0|
 +--+

 Can you upgrade to 1.5.1 ?

 Cheers

 On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
 wrote:

> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
> available in Spark 1.4.0 as per JAvadocx
>
> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
> wrote:
>
>> Hi Ted thanks much for the detailed answer and appreciate your
>> efforts. Do we need to register Hive UDFs?
>>
>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>
>> I am calling Hive UDF percentile_approx in the following manner which
>> gives compilation error
>>
>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>> error
>>
>> //compile error because callUdf() takes String and Column* as
>> arguments.
>>
>> Please guide. Thanks much.
>>
>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>
>>> Using spark-shell, I did the following exercise (master branch) :
>>>
>>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>> * v + cnst)
>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>> UserDefinedFunction(,IntegerType,List())
>>>
>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>> lit(25))).show()
>>> +---++
>>> | id|'simpleUDF(value,25)|
>>> +---++
>>> |id1|  26|
>>> |id2|  41|
>>> |id3|  50|
>>> +---++
>>>
>>> Which Spark release are you using ?
>>>
>>> Can you pastebin the full stack trace where you got the error ?
>>>
>>> Cheers
>>>
>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>> wrote:
>>>
 I have a doubt Michael I tried to use callUDF in  the following
 code it does not work.

 sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))

 Above code does not compile because callUdf() takes only two
 arguments function name in String and Column class type. Please guide.

 On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha  wrote:

> thanks much Michael let me try.
>
> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> This is confusing because I made a typo...
>>
>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>
>> The first argument is the name of the UDF, all other arguments
>> need to be columns that are passed in as 

Changing application log level in standalone cluster

2015-10-13 Thread Tom Graves
I would like to change the logging level for my application running on a 
standalone Spark cluster.  Is there an easy way to do that  without changing 
the log4j.properties on each individual node?
Thanks,Tom

Machine learning with spark (book code example error)

2015-10-13 Thread Zsombor Egyed
Hi!

I was reading the ML with spark book, and I was very interested about the
9. chapter (text mining), so I tried code examples.

Everything was fine, but in this line:

val testLabels = testRDD.map {

case (file, text) => val topic = file.split("/").takeRight(2).head

newsgroupsMap(topic) }

I got an error: "value newsgroupsMap is not a member of String"

Other relevant part of the code:
val path = "/PATH/20news-bydate-train/*"
val rdd = sc.wholeTextFiles(path)
val newsgroups = rdd.map { case (file, text) =>
file.split("/").takeRight(2).head }

val tf = hashingTF.transform(tokens)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)

val newsgroupsMap = newsgroups.distinct.collect().zipWithIndex.toMap
val zipped = newsgroups.zip(tfidf)
val train = zipped.map { case (topic, vector)
=>LabeledPoint(newsgroupsMap(topic), vector) }
train.cache

val model = NaiveBayes.train(train, lambda = 0.1)

val testPath = "/PATH//20news-bydate-test/*"
val testRDD = sc.wholeTextFiles(testPath)
val testLabels = testRDD.map { case (file, text) => val topic =
file.split("/").takeRight(2).head newsgroupsMap(topic) }

I attached the whole program code.
Can anyone help, what the problem is?

Regards,
Zsombor


scala-shell-code_09.scala
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Fwd: Problem about cannot open shared object file

2015-10-13 Thread 赵夏
Hello Everyone:
  I am new to Spark. Now I meet a problem which I cannot solve by using
Google.
  I have run the Pi example on Hadoop 2.6 successfully.
  After exploiting the spark platform and try to run Pi using spark
"./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster  --num-executors 2  lib/spark-examples*.jar 10", some problems
occurred.

I listed the error log (found in the userlog directory located at the
slaver node's hadoop directory) below and marked the possible reason to
read color. However, I really do not know how to solve them.



Many thanks
Xia



136 15/10/13 21:00:19 WARN scheduler.TaskSetManager: Lost task 0.0 in
stage 0.0 (TID 0, slave4): java.io.IOException:
java.lang.reflect.InvocationTargetException
137 at
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1178)
138 at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
139 at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
140 at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
141 at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
142 at
org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
143 at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
144 at org.apache.spark.scheduler.Task.run(Task.scala:88)
145 at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
146 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
147 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
148 at java.lang.Thread.run(Thread.java:745)
149 Caused by: java.lang.reflect.InvocationTargetException
150 at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
151 at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
152 at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
153 at
java.lang.reflect.Constructor.newInstance(Constructor.java:408)
154 at
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:67)
155 at
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
156 at org.apache.spark.broadcast.TorrentBroadcast.org
$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
157 at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:167)
158 at
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1175)
159 ... 11 more
160 Caused by: java.lang.IllegalArgumentException:
java.lang.UnsatisfiedLinkError:
/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1444764656424_0003/container_1444764656424_0003_02_03/tmp/snappy-1.0.4.1-43a64f64-fd65-4f5d-82d0-275eacbd4420-libs
   nappyjava.so:
/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1444764656424_0003/container_1444764656424_0003_02_03/tmp/snappy-1.0.4.1-43a64f64-fd65-4f5d-82d0-275eacbd4420-libsnappyjava.so:
cannot open shared object file: No such file or directory
161 at
org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:151)
162 ... 20 more
163 Caused by: java.lang.UnsatisfiedLinkError:
/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1444764656424_0003/container_1444764656424_0003_02_03/tmp/
snappy-1.0.4.1-43a64f64-fd65-4f5d-82d0-275eacbd4420-libsnappyjava.so:
/opt/hadoop/tmp/nm-loc
 
al-dir/usercache/root/appcache/application_1444764656424_0003/container_1444764656424_0003_02_03/tmp/
snappy-1.0.4.1-43a64f64-fd65-4f5d-82d0-275eacbd4420-libsnappyjava.so: cannot
open shared object file: No such file or directory
164 at java.lang.ClassLoader$NativeLibrary.load(Native Method)
165 at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1937)
166 at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1822)
167 at java.lang.Runtime.load0(Runtime.java:809)
168 at java.lang.System.load(System.java:1083)
169 at
org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:166)
170 at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:145)
171 at org.xerial.snappy.Snappy.(Snappy.java:47)
172 at
org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:149)
173 ... 20 more
174
175 15/10/13 21:00:19 INFO scheduler.TaskSetManager: Lost task 1.0 in
stage 0.0 (TID 1) on executor slave4: java.io.IOException
(java.lang.reflect.InvocationTargetException) [duplicate 1]
176 15/10/13 21:00:19 INFO scheduler.TaskSetManager: Starting task 1.1
in stage 0.0 (TID 3, slave4, PROCESS_LOCAL, 2085 bytes)
177 15/10/13 21:00:19 INFO 

Spark 1.5 java.net.ConnectException: Connection refused

2015-10-13 Thread Spark Newbie
Hi Spark users,

I'm seeing the below exception in my spark streaming application. It
happens in the first stage where the kinesis receivers receive records and
perform a flatMap operation on the unioned Dstream. A coalesce step also
happens as a part of that stage for optimizing the performance.

This is happening on my spark 1.5 instance using kinesis-asl-1.5. When I
look at the executor logs I do not see any exceptions indicating the root
cause of why there is no connectivity on xxx.xx.xx.xxx:36684 or when did
that service go down.

Any help debugging this problem will be helpful.

15/10/13 16:36:07 ERROR shuffle.RetryingBlockFetcher: Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to
ip-xxx-xx-xx-xxx.ec2.internal/xxx.xx.xx.xxx:36684
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
at
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:593)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:593)
at
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:579)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:623)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:139)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:135)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:135)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused:
ip-xxx-xx-xx-xxx.ec2.internal/xxx.xx.xx.xxx:36684

Thanks,
Bharath


Any plans to support Spark Streaming within an interactive shell?

2015-10-13 Thread YaoPau
I'm seeing products that allow you to interact with a stream in realtime
(write code, and see the streaming output automatically change), which I
think makes it easier to test streaming code, although running it on batch
then turning streaming on certainly is a good way as well.

I played around with trying to get Spark Streaming working within the spark
shell, but it looks like as soon as I stop my ssc I can't restart it again. 
Are there any open tickets to add this functionality?  Just looking to see
if it's on the roadmap at all.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Any-plans-to-support-Spark-Streaming-within-an-interactive-shell-tp25053.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARK SQL Error

2015-10-13 Thread pnpritchard
Your app jar should be at the end of the command, without the --jars prefix.
That option is only necessary if you have more than one jar to put on the
classpath (i.e. dependency jars that aren't packaged inside your app jar).

spark-submit --master yarn --class org.spark.apache.CsvDataSource --files
hdfs:///people_csv /home/cloudera/Desktop/TestMain.jar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Error-tp25050p25052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Announcement: Hackathon at Netherlands Cancer Institute next week

2015-10-13 Thread Kees van Bochove
Dear all,

I'd like to point out that 19-21 October we have a hackathon at the
Netherlands Cancer Institute in Amsterdam, which is about connecting a
major open source bioinformatics / medical informatics research
datawarehouse (called tranSMART) to SparkR, by implementing the RDD
interface via the core API's of tranSMART.

I think this is a nice example of a crossover between two mostly disjoint
open source communities, in this case by medical researchers seeking to
leverage available components around Spark such as SparkR. Anyone with
Spark experience would be most welcome to drop by! Details at
http://lanyrd.com/2015/transmart-foundation-annual-meeting/sdrzrq/

Met vriendelijke groet / Kind regards,

Kees van Bochove


Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
I modified DataFrameSuite, in master branch, to call percentile_approx
instead of simpleUDF :

- deprecated callUdf in SQLContext
- callUDF in SQLContext *** FAILED ***
  org.apache.spark.sql.AnalysisException: undefined function
percentile_approx;
  at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
  at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
  at scala.Option.getOrElse(Option.scala:120)
  at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:63)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
  at
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
  at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)

SPARK-10671 is included.
For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL
treats percentile_approx as normal UDF.

Experts can correct me, if there is any misunderstanding.

Cheers

On Tue, Oct 13, 2015 at 6:09 AM, Umesh Kacha  wrote:

> Hi Ted I am using the following line of code I can't paste entire code
> sorry but the following only line doesn't compile in my spark job
>
>  sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))
>
> I am using Intellij editor java and maven dependencies of spark core spark
> sql spark hive version 1.5.1
> On Oct 13, 2015 18:21, "Ted Yu"  wrote:
>
>> Can you pastebin your Java code and the command you used to compile ?
>>
>> Thanks
>>
>> On Oct 13, 2015, at 1:42 AM, Umesh Kacha  wrote:
>>
>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>> 1.5.1 binary in spark-shell.
>> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>>
>>> Looks like the fix went in after 1.5.1 was released.
>>>
>>> You may verify using master branch build.
>>>
>>> Cheers
>>>
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>>
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>> and column types only. Please guide.
>>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>>>
 SQL context available as sqlContext.

 scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
 "value")
 df: org.apache.spark.sql.DataFrame = [id: string, value: int]

 scala> df.select(callUDF("percentile_approx",col("value"),
 lit(0.25))).show()
 +--+
 |'percentile_approx(value,0.25)|
 +--+
 |   1.0|
 +--+

 Can you upgrade to 1.5.1 ?

 Cheers

 On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
 wrote:

> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
> available in Spark 1.4.0 as per JAvadocx
>
> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
> wrote:
>
>> Hi Ted thanks much for the detailed answer and appreciate your
>> efforts. Do we need to register Hive UDFs?
>>
>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>
>> I am calling Hive UDF percentile_approx in the following manner which
>> gives compilation error
>>
>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>> error
>>
>> //compile error because callUdf() takes String and Column* as
>> arguments.
>>
>> Please guide. Thanks much.
>>
>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>
>>> Using spark-shell, I did the following exercise (master branch) :
>>>
>>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>> * v + cnst)
>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>> 

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-13 Thread Tathagata Das
Is this happening too often? Is it slowing things down or blocking
progress. Failures once in a while is part of the norm, and the system
should take care of itself.

On Tue, Oct 13, 2015 at 2:47 PM, Spark Newbie 
wrote:

> Hi Spark users,
>
> I'm seeing the below exception in my spark streaming application. It
> happens in the first stage where the kinesis receivers receive records and
> perform a flatMap operation on the unioned Dstream. A coalesce step also
> happens as a part of that stage for optimizing the performance.
>
> This is happening on my spark 1.5 instance using kinesis-asl-1.5. When I
> look at the executor logs I do not see any exceptions indicating the root
> cause of why there is no connectivity on xxx.xx.xx.xxx:36684 or when did
> that service go down.
>
> Any help debugging this problem will be helpful.
>
> 15/10/13 16:36:07 ERROR shuffle.RetryingBlockFetcher: Exception while
> beginning fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to
> ip-xxx-xx-xx-xxx.ec2.internal/xxx.xx.xx.xxx:36684
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
> at
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
> at
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
> at
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:593)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:593)
> at
> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:579)
> at
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:623)
> at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:139)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:135)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:135)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused:
> 

Re: updateStateByKey and stack overflow

2015-10-13 Thread Tian Zhang
It turns out that our hdfs checkpoint failed, but spark streaming
is running and building up a long lineage ... 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-and-stack-overflow-tp25015p25054.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any plans to support Spark Streaming within an interactive shell?

2015-10-13 Thread Tathagata Das
StreamingContext is not designed to be reused. Its not in the immediate
roadmap.
You have to create a new StreamingContext.
Also you should stop the streamingContext without stopping the SparkContext
of the shell.

On Tue, Oct 13, 2015 at 1:47 PM, YaoPau  wrote:

> I'm seeing products that allow you to interact with a stream in realtime
> (write code, and see the streaming output automatically change), which I
> think makes it easier to test streaming code, although running it on batch
> then turning streaming on certainly is a good way as well.
>
> I played around with trying to get Spark Streaming working within the spark
> shell, but it looks like as soon as I stop my ssc I can't restart it again.
> Are there any open tickets to add this functionality?  Just looking to see
> if it's on the roadmap at all.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Any-plans-to-support-Spark-Streaming-within-an-interactive-shell-tp25053.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Building with SBT and Scala 2.11

2015-10-13 Thread Jakob Odersky
I'm having trouble compiling Spark with SBT for Scala 2.11. The command I
use is:

dev/change-version-to-2.11.sh
build/sbt -Pyarn -Phadoop-2.11 -Dscala-2.11

followed by

compile

in the sbt shell.

The error I get specifically is:

spark/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
no valid targets for annotation on value conf - it is discarded unused. You
may specify targets with meta-annotations, e.g. @(transient @param)
[error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
[error]

However I am also getting a large amount of deprecation warnings, making me
wonder if I am supplying some incompatible/unsupported options to sbt. I am
using Java 1.8 and the latest Spark master sources.
Does someone know if I am doing anything wrong or is the sbt build broken?

thanks for you help,
--Jakob


Re: Building with SBT and Scala 2.11

2015-10-13 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtY7aX22B44dB

On Tue, Oct 13, 2015 at 5:53 PM, Jakob Odersky  wrote:

> I'm having trouble compiling Spark with SBT for Scala 2.11. The command I
> use is:
>
> dev/change-version-to-2.11.sh
> build/sbt -Pyarn -Phadoop-2.11 -Dscala-2.11
>
> followed by
>
> compile
>
> in the sbt shell.
>
> The error I get specifically is:
>
> spark/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
> no valid targets for annotation on value conf - it is discarded unused. You
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf:
> SparkConf)
> [error]
>
> However I am also getting a large amount of deprecation warnings, making
> me wonder if I am supplying some incompatible/unsupported options to sbt. I
> am using Java 1.8 and the latest Spark master sources.
> Does someone know if I am doing anything wrong or is the sbt build broken?
>
> thanks for you help,
> --Jakob
>
>


Fwd: [Streaming] join events in last 10 minutes

2015-10-13 Thread Daniel Li
re-post to the right group.

-- Forwarded message --
From: Daniel Li 
Date: Tue, Oct 13, 2015 at 5:14 PM
Subject: [Streaming] join events in last 10 minutes
To: d...@spark.apache.org


We have a scenario that events from three kafka topics sharing the same
keys need to be merged. One topic has the master events; most events in
other two topics arrive within 10 minutes of master event arrival. Wrote
pseudo code below. I'd love to hear your thoughts whether I am on the right
track.

// Scenario
// (1) Merging events from Kafka topic1, topic2 and topic 3 sharing
the same keys
// (2) Events in topic1 are master events
// (3) One master event may have associated event in Topic2 and/or
Topic3 sharing the same key
// (4) Most events in topic2 and topic3 will arrive within 10
minutes of the master event arrival
//
// Pseudo code
// Use 1-minute window of events in topic1, to left-outer-join with
next 10-minute of events from
// topic2 and topic3


// parse the event to form key-value pair
def parse(v:String) = {
(v.split(",")(0), v)
}

// Create context with 1 minute batch interval
val sparkConf = new SparkConf().setAppName("MergeLogs")
val ssc = new StreamingContext(sparkConf, Minutes(1))
ssc.checkpoint(checkpointDirectory)

// Create direct kafka stream with brokers and topics
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)

val stream1 = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](
  ssc, kafkaParams, Set(“topic1”)
stream1.checkpoint(Minutes(5)
val pairStream1 = stream1.map(_._2).map(s => parse(s))

val stream2 = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](
  ssc, kafkaParams, Set(“topic2”)
stream2.checkpoint(Minutes(5)
val pairStream2 = stream2.map(_._2).map(s => parse(s))

val stream3 = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](
  ssc, kafkaParams, Set(“topic3”)
stream3.checkpoint(Minutes(5)
val pairStream3 = stream3.map(_._2).map(s => parse(s))

// load 1 minute of master events from topic 1
val windowedStream1 = pairStream1.window(Minutes(1))

// load 10 minutes of topic1 and topic2
val windowedStream2 = pairStream2.window(Minutes(10), Minutes(1))
val windowedStream3 = pairStream3.window(Minutes(10), Minutes(1))

// lefter join topic1 with topic2 and topic3
*val joinedStream =
windowedStream1.leftOuterJoin(windowedStream2).leftOuterJoin(windowedStream3)*

// dump merged events
joinedStream.foreachRDD { rdd =>
val connection = createNewConnection()  // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}

// Start the computation
val ssc = StreamingContext.getOrCreate(checkpointDirectory,
  () => {
createContext(ip, port, outputPath, checkpointDirectory)
  })
ssc.start()
ssc.awaitTermination()

thx
Daniel


Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-13 Thread Ted Yu
Still 404 as of a moment ago.

On Mon, Oct 12, 2015 at 9:04 PM, Ted Yu  wrote:

> I checked commit history of streaming/pom.xml
>
> There should be no difference between 1.5.0 and 1.5.1
>
> You can download 1.5.1's pom.xml and rename it so that you get unblocked.
>
> On Mon, Oct 12, 2015 at 8:50 PM, Keiji Yoshida  > wrote:
>
>> Thanks for the reply.
>>
>> Yes, I know 1.5.1 is available but I need to use 1.5.0 because I need to
>> run Spark applications on Cloud Dataproc (
>> https://cloud.google.com/dataproc/ ) which supports only 1.5.0.
>>
>> On Tue, Oct 13, 2015 at 12:13 PM, Ted Yu  wrote:
>>
>>> I got 404 as well.
>>>
>>> BTW 1.5.1 has been released.
>>> I was able to access:
>>>
>>> http://central.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.5.1/spark-streaming_2.10-1.5.1.pom
>>>
>>> FYI
>>>
>>> On Mon, Oct 12, 2015 at 8:09 PM, y  wrote:
>>>
 When I access the following URL, I often get a 404 error and I cannot
 get the
 POM file of "spark-streaming_2.10-1.5.0.pom".


 http://central.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.5.0/spark-streaming_2.10-1.5.0.pom

 Are there any problems inside the maven repository? Are there any way
 to get
 around the problem and get the POM file for sure?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-get-spark-streaming-2-10-1-5-0-pom-from-the-maven-repository-tp25042.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
Hey Spark users,

I'm trying to group by a dataframe, by appending occurrences into a list
instead of count.

Let's say we have a dataframe as shown below:

| category | id |
|  |:--:|
| A| 1  |
| A| 2  |
| B| 3  |
| B| 4  |
| C| 5  |

ideally, after some magic group by (reverse explode?):

| category | id_list  |
|  |  |
| A| 1,2  |
| B| 3,4  |
| C| 5|

any tricks to achieve that? Scala Spark API is preferred. =D

BR,
Todd Leo