转发: Error:scalac: Error: assertion failed: List(object package$DebugNode, object package$DebugNode)

2015-12-30 Thread zml张明磊
I’m sorry. The error is not when I build spark occurs. It’s happen when running the example with LogisticRegreesionWithElasticNetExample.scala. 发件人: zml张明磊 [mailto:mingleizh...@ctrip.com] 发送时间: 2015年12月31日 15:01 收件人: user@spark.apache.org 主题: Error:scalac: Error: assertion failed: List(object

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Yanbo Liang
Hi Jia, You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it can produce stable performance. The storage level of MEMORY_AND_DISK will store the partitions that don't fit on disk and read them from there when they are needed. Actually, it's not necessary to set so large

K means clustering in spark

2015-12-30 Thread anjali . gautam09
Hi, I am trying to use kmeans for clustering in spark using python. I implemented it on the data set which spark has within. It's a 3*4 matrix. Can anybody please help me with how and what is orientation of data for kmeans. Also how to find out what all clusters and its members are. Thanks

Re: 回复: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-30 Thread SparkUser
Sounds like you guys are on the right track, this is purely FYI because I haven't seen it posted, just responding to the line in the original post that your data structure should fit in memory. OK two more disclaimers "FWIW" and "maybe this is not relevant or already covered" OK here goes...

SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Dear Spark community, I faced the following issue with trying accessing data on S3a, my code is the following: val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

How to register a Tuple3 with KryoSerializer?

2015-12-30 Thread Russ
I need to register with KryoSerializer a Tuple3 that is generated by a call to the sortBy() method that eventually calls collect() from Partitioner.RangePartitioner.sketch(). The IntelliJ Idea debugger indicates that the for the Tuple3 are java.lang.Integer, java.lang.Integer and long[].  So,

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-30 Thread Chris Fregly
There are a few diff ways to apply approximation algorithms and probabilistic data structures to your Spark data - including Spark's countApproxDistinct() methods as you pointed out. There's also Twitter Algebird, and Redis HyperLogLog (PFCOUNT, PFADD). Here's some examples from my *pipeline

Re: 回复: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-30 Thread Chris Fregly
@Jim- I'm wondering if those docs are outdated as its my understanding (please correct if I'm wrong), that we should never be seeing OOMs as 1.5/Tungsten not only improved (reduced) the memory footprint of our data, but also introduced better task level - and even key level - external spilling

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Dawid Wysakowicz
I do understand that Snappy is not splittable as such, but ORCFile is. In ORC blocks are compressed with snappy so there should be no problem with it. Anyway ZLIB(used both in ORC and Parquet by default) is also not splittable but it works perfectly fine. 2015-12-30 16:26 GMT+01:00 Chris Fregly

DStream keyBy

2015-12-30 Thread Brian London
RDD has a method keyBy[K](f: T=>K) that acts as an alias for map(x => (f(x), x)) and is useful for generating pair RDDs. Is there a reason this method doesn't exist on DStream? It's a fairly heavily used method and allows clearer code than the more verbose map.

Using Experminal Spark Features

2015-12-30 Thread David Newberger
Hi All, I've been looking at the Direct Approach for streaming Kafka integration (http://spark.apache.org/docs/latest/streaming-kafka-integration.html) because it looks like a good fit for our use cases. My concern is the feature is experimental according to the documentation. Has anyone used

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
are the credentials visible from each Worker node to all the Executor JVMs on each Worker? > On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev > wrote: > > Dear Spark community, > > I faced the following issue with trying accessing data on S3a, my code is

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Chris, good question, as you can see from the code I set up them on driver, so I expect they will be propagated to all nodes, won't them? Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly wrote: > are the credentials visible from each Worker

Re: Problem with WINDOW functions?

2015-12-30 Thread Vadim Tkachenko
Davies, Thank you, I will wait on 1.6 release. http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDOW-functions-tt25833.html ? On Wed, Dec 30, 2015 at 12:06 AM, Davies Liu wrote: > Window functions are improved in 1.6 release, could you try 1.6-RC4 >

SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Dear Spark community, I faced the following issue with trying accessing data on S3a, my code is the following: val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Blaž Šnuderl
Try setting s3 credentials using keys specified here https://github.com/Aloisius/hadoop-s3a/blob/master/README.md Blaz On Dec 30, 2015 6:48 PM, "KOSTIANTYN Kudriavtsev" < kudryavtsev.konstan...@gmail.com> wrote: > Dear Spark community, > > I faced the following issue with trying accessing data

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Hi Blaz, I did, the same result Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 12:54 PM, Blaž Šnuderl wrote: > Try setting s3 credentials using keys specified here > https://github.com/Aloisius/hadoop-s3a/blob/master/README.md > > Blaz > On Dec 30, 2015 6:48 PM,

Re: Can't submit job to stand alone cluster

2015-12-30 Thread SparkUser
Sorry need to clarify: When you say: /When the docs say //"If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes,"//it is actually saying that your executors get their jars from the driver. This is true

Re: Can't submit job to stand alone cluster

2015-12-30 Thread Andrew Or
Hi Jim, Just to clarify further: - *Driver *is the process with SparkContext. A driver represents an application (e.g. spark-shell, SparkPi) so there is exactly one driver in each application. - *Executor *is the process that runs the tasks scheduled by the driver. There should

Re: Problem with WINDOW functions?

2015-12-30 Thread Vadim Tkachenko
Gokula, Thanks, I will try this. I am just SQL kind of guy :), but I will try your suggestion Thanks, Vadim On Wed, Dec 30, 2015 at 1:07 PM, Gokula Krishnan D wrote: > Hello Vadim - > > Alternatively, you can achieve by using the *window functions* which is > available

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
couple things: 1) switch to IAM roles if at all possible - explicitly passing AWS credentials is a long and lonely road in the end 2) one really bad workaround/hack is to run a job that hits every worker and writes the credentials to the proper location (~/.awscredentials or whatever) ^^ i

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Chris, thanks for the hist with AIM roles, but in my case I need to run different jobs with different S3 permissions on the same cluster, so this approach doesn't work for me as far as I understood it Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly

2 of 20,675 Spark Streaming : Out put frequency different from read frequency in StatefulNetworkWordCount

2015-12-30 Thread Soumitra Johri
Hi, in the example : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala the streaming frequency is 1 seconds however I do not want to print the contents of the word-counts every minute and resent the word counts

Re: How to ignore case in dataframe groupby?

2015-12-30 Thread Eran Witkon
Drop the original column and rename the new column See df.drop & df.withcolimnrenamed Eran On Wed, 30 Dec 2015 at 19:08 raja kbv wrote: > Solutions from Eran & Yanbo are working well. Thank you. > > @Eran, > > Your solution worked with a small change. >

Re: Problem with WINDOW functions?

2015-12-30 Thread Gokula Krishnan D
Hello Vadim - Alternatively, you can achieve by using the *window functions* which is available from 1.4.0 *code_value.txt (Input)* = 1000,200,Descr-200,01 1000,200,Descr-200-new,02 1000,201,Descr-201,01 1000,202,Descr-202-new,03 1000,202,Descr-202,01

Re: How to register a Tuple3 with KryoSerializer?

2015-12-30 Thread Shixiong(Ryan) Zhu
You can use "_", e.g., sparkConf.registerKryoClasses(Array(classOf[scala.Tuple3[_, _, _]])) Best Regards, Shixiong(Ryan) Zhu Software Engineer Databricks Inc. shixi...@databricks.com databricks.com On Wed, Dec 30, 2015 at 10:16 AM, Russ

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn, Can you define those properties in hdfs-site.xml and make sure it is visible in the class path when you spark-submit? It looks like a conf sourcing issue to me. Cheers, Sent from my iPhone > On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev >

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Hi Jerry, I want to run different jobs on different S3 buckets - different AWS creds - on the same instances. Could you shed some light if it's possible to achieve with hdfs-site? Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam wrote: > Hi

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn, I want to confirm that it works first by using hdfs-site.xml. If yes, you could define different spark-{user-x}.conf and source them during spark-submit. let us know if hdfs-site.xml works first. It should. Best Regards, Jerry Sent from my iPhone > On 30 Dec, 2015, at 2:31

Re: 2 of 20,675 Spark Streaming : Out put frequency different from read frequency in StatefulNetworkWordCount

2015-12-30 Thread Shixiong(Ryan) Zhu
You can use "reduceByKeyAndWindow", e.g., val lines = ssc.socketTextStream("localhost", ) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(60), Seconds(60)) wordCounts.print() On Wed, Dec

Re: Using Experminal Spark Features

2015-12-30 Thread Chris Fregly
A lot of folks are using the new Kafka Direct Stream API in production. And a lot of folks who used the old Kafka Receiver-based API are migrating over. The usual downside to "Experimental" features in Spark is that the API might change, so you'll need to rewrite some code. Stability-wise, the

Re: How to ignore case in dataframe groupby?

2015-12-30 Thread raja kbv
Solutions from Eran & Yanbo are working well. Thank you. @Eran, Your solution worked with a small change. DF.withColumn("upper-code",upper(df("countrycode"))). This creates a new column "upper-code". Is there a way to update the column or create a new df with update column? Thanks,Raja

Working offline with spark-core and sbt

2015-12-30 Thread Ashic Mahtab
Hello,I'm trying to work offline with spark-core. I've got an empty project with the following: name := "sbtSand" version := "1.0" scalaVersion := "2.11.7" libraryDependencies ++= Seq( "joda-time" % "joda-time" % "2.9.1", "org.apache.spark" %% "spark-core" % "1.5.2" ) I can "sbt

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-30 Thread lin
Hi yanbo, thanks for the quick response. Looks like we'll need to do some work-around. But before that, we'd like to dig into some related discussions first. We've looked through the following urls, but none seems helpful. Mailing list threads:

RE: Working offline with spark-core and sbt

2015-12-30 Thread Ashic Mahtab
To answer my own question, it appears certain tihngs (like parents, etc.) caused the issue. I was using sbt 0.13.8. Using 0.13.9 works fine. From: as...@live.com To: user@spark.apache.org Subject: Working offline with spark-core and sbt Date: Thu, 31 Dec 2015 02:07:26 + Hello,I'm trying

Error while starting Zeppelin Service in HDP2.3.2

2015-12-30 Thread Divya Gehlot
Hi, I am getting following error while starting the Zeppelin service from ambari server . /var/lib/ambari-agent/data/errors-2408.txt Traceback (most recent call last): File "/var/lib/ambari-agent/cache/stacks/HDP/2.3/services/ZEPPELIN/package/scripts/master.py", line 295, in

Error:scalac: Error: assertion failed: List(object package$DebugNode, object package$DebugNode)

2015-12-30 Thread zml张明磊
Hello, Recently, I build spark from apache/master and getting the following error. From stackoverflow http://stackoverflow.com/questions/24165184/scalac-assertion-failed-while-run-scalatest-in-idea, I can not find Preferences > Scala he said in Intellij IDEA. And SBT is not worked for me in

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-30 Thread Yanbo Liang
This problem has been discussed long before, but I think there is no straight forward way to read only col_g. 2015-12-30 17:48 GMT+08:00 lin : > Hi all, > > We are trying to read from nested parquet data. SQL is "select > col_b.col_d.col_g from some_table" and the data

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Dawid Wysakowicz
Didn't anyone used spark with orc and snappy compression? 2015-12-29 18:25 GMT+01:00 Dawid Wysakowicz : > Hi, > > I have a table in hive stored as orc with compression = snappy. I try to > execute a query on that table that fails (previously I run it on table in >

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-30 Thread Adrian Bridgett
I've worked around this by setting spark.shuffle.io.connectionTimeout=3600s, uploading the spark tarball to HDFS again and restarting the shuffle service (not 100% sure that last step is needed). I attempted (with my newbie Scala skills) to allow ExternalShuffleClient() to accept an optional

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-30 Thread Adrian Bridgett
Hi Ted, sorry I should have been a bit more consistent in my cut and paste (there are nine nodes +driver) - I was concentrating on S9/6 (these logs are from that box - 10.1.201.165). S1/4 lines are: 15/12/29 18:49:45 INFO CoarseMesosSchedulerBackend: Registered executor

[SparkSQL][Parquet] Read from nested parquet data

2015-12-30 Thread lin
Hi all, We are trying to read from nested parquet data. SQL is "select col_b.col_d.col_g from some_table" and the data schema for some_table is: root |-- col_a: long (nullable = false) |-- col_b: struct (nullable = true) ||-- col_c: string (nullable = true) ||-- col_d: array

Re: Problem with WINDOW functions?

2015-12-30 Thread Davies Liu
Window functions are improved in 1.6 release, could you try 1.6-RC4 (or wait until next week for the final release)? Even In 1.6, the buffer of rows for window function does not support spilling (also does not use memory efficiently), there is a JIRA for it:

Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Jia Zou
I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU cores and 30GB memory. Executor memory is set to 15GB, and driver memory is set to 15GB. The observation is that, when input data size is smaller than 15GB, the performance is quite stable. However, when input data becomes

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-30 Thread Adrian Bridgett
To wrap this up, it's the shuffle manager sending the FIN so setting spark.shuffle.io.connectionTimeout to 3600s is the only workaround right now. SPARK-12583 raised. Adrian -- *Adrian Bridgett*

Monitoring Spark HDFS Reads and Writes

2015-12-30 Thread alvarobrandon
Hello: Is there anyway of monitoring the number of Bytes or blocks read and written by an Spark application?. I'm running Spark with YARN and I want to measure how I/O intensive a set of applications are. Closest thing I have seen is the HDFS DataNode Logs in YARN but they don't seem to have

Re: Spark job submission REST API

2015-12-30 Thread Fernando O.
One of the advantages of using spark-jobserver is that it lets you reuse your contexts (create one context and run multiple jobs on it) Since you can multiple jobs in one context, you can also share RDDs (NamedRDD) between jobs ie: create a MLLib model and share it without the need to persist it.

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Chris Fregly
Reminder that Snappy is not a splittable format. I've had success with Hive + LZF (splittable) and bzip2 (also splittable). Gzip is also not splittable, so you won't be utilizing your cluster to process this data in parallel as only 1 task can read and process unsplittable data - versus many