does dstream.transform() run on the driver node?

2015-08-07 Thread lookfwd
Hello, here's a simple program that demonstrates my problem: Is keyavg = rdd.values().reduce(sum) / rdd.count() inside stats calculated one time per partition or it's just once? I guess another way to ask the same question is DStream.transform() is called on the driver node or not? What would

Re: Spark failed while trying to read parquet files

2015-08-07 Thread Cheng Lian
It doesn't seem to be Parquet 1.7.0 since the package name isn't under org.apache.parquet (1.7.0 is the first official Apache release of Parquet). The version you were using is probably Parquet 1.6.0rc3 according to the line number information:

Re: Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Yes! I was being dumb, should have caught that earlier, thank you Cheng Lian On Fri, Aug 7, 2015 at 4:25 PM, Cheng Lian lian.cs@gmail.com wrote: It doesn't seem to be Parquet 1.7.0 since the package name isn't under org.apache.parquet (1.7.0 is the first official Apache release of

Re: How to get total CPU consumption for Spark job

2015-08-07 Thread gen tang
Hi, Spark UI or logs don't provide the situation of cluster. However, you can use Ganglia to monitor the situation of cluster. In spark-ec2, there is an option to install ganglia automatically. If you use CDH, you can also use Cloudera manager. Cheers Gen On Sat, Aug 8, 2015 at 6:06 AM, Xiao

Checkpoint Dir Error in Yarn

2015-08-07 Thread Mohit Anchlia
I am running in yarn-client mode and trying to execute network word count example. When I connect through nc I see the following in spark app logs: Exception in thread main java.lang.AssertionError: assertion failed: The checkpoint directory has not been set. Please use

Re: Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread gen tang
Hi, In fact, Pyspark use org.apache.spark.examples.pythonconverters(./examples/src/main/scala/org/apache/spark/pythonconverters/) to transform object of Hbase result to python string. Spark update these two scripts recently. However, they are not included in the official release of spark. So you

Re: Spark MLib v/s SparkR

2015-08-07 Thread gen tang
Hi, It depends on the problem that you work on. Just as python and R, Mllib focuses on machine learning and SparkR will focus on statistics, if SparkR follow the way of R. For instance, If you want to use glm to analyse data: 1. if you are interested only in parameters of model, and use this

Re: Checkpoint Dir Error in Yarn

2015-08-07 Thread Tathagata Das
Have you tried to do what its suggesting? If you want to learn more about checkpointing, you can see the programming guide - http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing For more in depth understanding, you can see my talk -

Re: [Spark Streaming] Session based windowing like in google dataflow

2015-08-07 Thread Tathagata Das
You can use Spark Streaming's updateStateByKey to do arbitrary sessionization. See the example - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala All it does is store a single number (count of each word seeing

using Spark or pig group by efficient in my use case?

2015-08-07 Thread linlma
I have a tens of million records, which is customer ID and city ID pair. There are tens of millions of unique customer ID, and only a few hundreds unique city ID. I want to do a merge to get all city ID aggregated for a specific customer ID, and pull back all records. I want to do this using group

Spark Maven Build

2015-08-07 Thread Benyi Wang
I'm trying to build spark 1.4.1 against CDH 5.3.2. I created a profile called cdh5.3.2 in spark_parent.pom, made some changes for sql/hive/v0.13.1, and the build finished successfully. Here is my problem: - If I run `mvn -Pcdh5.3.2,yarn,hive install`, the artifacts are installed into my

Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hi, A silly question here. The Driver Web UI dies when the spark-submit program finish. I would like some time to analyze after the program ends, as the page does not refresh it self, when I hit F5 I lose all the info. Thanks, Saif

Issue when rebroadcasting a variable outside of the definition scope

2015-08-07 Thread simone.robutti
Hello everyone, this is my first message ever to a mailing list so please pardon me if for some reason I'm violating the etiquette. I have a problem with rebroadcasting a variable. How it should work is not well documented so I could find only a few and simple example to understand how it should

Estimate size of Dataframe programatically

2015-08-07 Thread Srikanth
Hello, Is there a way to estimate the approximate size of a dataframe? I know we can cache and look at the size in UI but I'm trying to do this programatically. With RDD, I can sample and sum up size using SizeEstimator. Then extrapolate it to the entire RDD. That will give me approx size of RDD.

Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
One possible solution is to spark-submit with --driver-class-path and list all recursive dependencies. This is fragile and error prone. Non-working alternatives (used in SparkSubmit.scala AFTER arguments parser is initialized): spark-submit --packages ... spark-submit --jars ...

Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
Offending commit is : [SPARK-6014] [core] Revamp Spark shutdown hooks, fix shutdown races. https://github.com/apache/spark/commit/e72c16e30d85cdc394d318b5551698885cfda9b8 -- View this message in context:

Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread andy petrella
Exactly! The sharing part is used in the Spark Notebook (this one https://github.com/andypetrella/spark-notebook/blob/master/notebooks/Tachyon%20Test.snb) so we can share stuffs between notebooks which are different SparkContext (in diff JVM). OTOH, we have a project that creates micro services

Re: SparkSQL: add jar blocks all queries

2015-08-07 Thread Wu, James C.
Hi, The issue only seems to happen when trying to access spark via the SparkSQL Thrift Server interface. Does anyone know a fix? james From: Wu, Walt Disney james.c...@disney.commailto:james.c...@disney.com Date: Friday, August 7, 2015 at 12:40 PM To:

Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread Eric Bless
I’m having some difficulty getting the desired results fromthe Spark Python example hbase_inputformat.py. I’m running with CDH5.4, hbaseVersion 1.0.0, Spark v 1.3.0 Using Python version 2.6.6   I followed the example to create a test HBase table. Here’sthe data from the table I created –

Re: spark config

2015-08-07 Thread Ted Yu
In master branch, build/sbt-launch-lib.bash has the following: URL1= https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar I verified that the following exists:

Re: tachyon

2015-08-07 Thread Abhishek R. Singh
Thanks Calvin - much appreciated ! -Abhishek- On Aug 7, 2015, at 11:11 AM, Calvin Jia jia.cal...@gmail.com wrote: Hi Abhishek, Here's a production use case that may interest you: http://www.meetup.com/Tachyon/events/222485713/ Baidu is using Tachyon to manage more than 100 nodes in

Re: spark config

2015-08-07 Thread Dean Wampler
That's the correct URL. Recent change? The last time I looked, earlier this week, it still had the obsolete artifactory URL for URL1 ;) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler

Re: spark config

2015-08-07 Thread Ted Yu
Looks like Sean fixed it: [SPARK-9633] [BUILD] SBT download locations outdated; need an update Cheers On Fri, Aug 7, 2015 at 3:22 PM, Dean Wampler deanwamp...@gmail.com wrote: That's the correct URL. Recent change? The last time I looked, earlier this week, it still had the obsolete

Accessing S3 files with s3n://

2015-08-07 Thread Akshat Aranya
Hi, I've been trying to track down some problems with Spark reads being very slow with s3n:// URIs (NativeS3FileSystem). After some digging around, I realized that this file system implementation fetches the entire file, which isn't really a Spark problem, but it really slows down things when

How to get total CPU consumption for Spark job

2015-08-07 Thread Xiao JIANG
Hi all, I was running some Hive/spark job on hadoop cluster. I want to see how spark helps improve not only the elapsed time but also the total CPU consumption. For Hive, I can get the 'Total MapReduce CPU Time Spent' from the log when the job finishes. But I didn't find any CPU stats for Spark

Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Hi all, I have a partitioned parquet table (very small table with only 2 partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I applied this patch to spark [SPARK-7743] so I assume that spark can read parquet files normally, however, I'm getting this when trying to do a simple

Re: Spark failed while trying to read parquet files

2015-08-07 Thread Philip Weaver
Yes, NullPointerExceptions are pretty common in Spark (or, rather, I seem to encounter them a lot!) but can occur for a few different reasons. Could you add some more detail, like what the schema is for the data, or the code you're using to read it? On Fri, Aug 7, 2015 at 3:20 PM, Jerrick Hoang

Re: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread François Pelletier
Hi, all spark processes are saved in the Spark History Server look at your host on port 18080 instead of 4040 François Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit : Hi, A silly question here. The Driver Web UI dies when the spark-submit program finish. I would like some time

RE: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hello, thank you, but that port is unreachable for me. Can you please share where can I find that port equivalent in my environment? Thank you Saif From: François Pelletier [mailto:newslett...@francoispelletier.org] Sent: Friday, August 07, 2015 4:38 PM To: user@spark.apache.org Subject: Re:

Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Hien, Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin going to use for workflow scheduling? Is there something else that's going to replace Azkaban? On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu yuzhih...@gmail.com wrote: In my opinion, choosing some particular project

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Sonal Goyal
There seems to be a version mismatch somewhere. You can try and find out the cause with debug serialization information. I think the jvm flag -Dsun.io.serialization.extendedDebugInfo=true should help. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at

miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Gerald Loeffler
hi, if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0, doesn’t that make it a deterministic/classical gradient descent rather than a SGD? Specifically, miniBatchFraction=1.0 means the entire data set, i.e. all rows. In the spirit of SGD, shouldn’t the default be the fraction that

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Cheng Lian
However, it's weird that the partition discovery job only spawns 2 tasks. It should use the default parallelism, which is probably 8 according to the logs of the next Parquet reading job. Partition discovery is already done in a distributed manner via a Spark job. But the parallelism is

JavsSparkContext causes hadoop.ipc.RemoteException error

2015-08-07 Thread junliu6
HI, I'm a new spark user,nowdays,I meet a wired erron happeded in our cluster. I depoly spark-1.3.1 and cdh5 on my cluster,weeks ago ,I depoly namenode HA on it. After that , my spark job meet error when I use JAVA-API,like this:

Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-07 Thread canan chen
Is there any reason that historyserver use another property for the event log dir ? Thanks

DataFrame column structure change

2015-08-07 Thread Rishabh Bhardwaj
Hi all, I want to have some nesting structure from the existing columns of the dataframe. For that,,I am trying to transform a DF in the following way,but couldn't do it. scala df.printSchema root |-- a: string (nullable = true) |-- b: string (nullable = true) |-- c: string (nullable = true)

StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread praveen S
Is StringIndexer + VectorAssembler equivalent to HashingTF while converting the document for analysis?

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Cheng Lian
Hi Philip, Thanks for providing the log file. It seems that most of the time are spent on partition discovery. The code snippet you provided actually issues two jobs. The first one is for listing the input directories to find out all leaf directories (and this actually requires listing all

Spark on YARN

2015-08-07 Thread Jem Tucker
Hi, I am running spark on YARN on the CDH5.3.2 stack. I have created a new user to own and run a testing environment, however when using this user applications I submit to yarn never begin to run, even if they are the exact same application that is successful with another user? Has anyone seen

Re: SparkR Supported Types - Please add bigint

2015-08-07 Thread Davies Liu
They are actually the same thing, LongType. `long` is friendly for developer, `bigint` is friendly for database guy, maybe data scientists. On Thu, Jul 23, 2015 at 11:33 PM, Sun, Rui rui@intel.com wrote: printSchema calls StructField. buildFormattedString() to output schema information.

How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Hao Ren
Is there any workaround to distribute non-serializable object for RDD transformation or broadcast variable ? Say I have an object of class C which is not serializable. Class C is in a jar package, I have no control on it. Now I need to distribute it either by rdd transformation or by broadcast.

RE: Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-07 Thread Ewan Leith
You'll have a lot less hassle using the AWS EMR instances with Spark 1.4.1 for now, until the spark_ec2.py scripts move to Hadoop 2.7.1, at the moment I'm pretty sure it's only using Hadoop 2.4 The EMR setup with Spark lets you use s3:// URIs with IAM roles Ewan -Original Message-

Re: How to binarize data in spark

2015-08-07 Thread Adamantios Corais
I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2? val data = test11.map(x= ((x(0) , x(1)) , x(2))).groupByKey().map(x= (x._1 , x._2.toArray)).map{x= var lt : Array[Double] = new

Spark streaming and session windows

2015-08-07 Thread Ankur Chauhan
Hi all, I am trying to figure out how to perform equivalent of Session windows (as mentioned in https://cloud.google.com/dataflow/model/windowing) using spark streaming. Is it even possible (i.e. possible to do efficiently at scale). Just to expand on the definition: Taken from the google

Re: StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread Peter Rudenko
No, here's an example: COL1 COL2 a one b two a two c three StringIndexer.setInputCol(COL1).setOutputCol(SI1) - (0- a, 1-b,2-c) SI1 0 1 0 2 StringIndexer.setInputCol(COL2).setOutputCol(SI2) - (0- one, 1-two, 2-three) SI1 0 1 1 2

SparkR -Graphx Connected components

2015-08-07 Thread smagadi
Hi I was trying to use stronglyconnectcomponents () Given a DAG is graph I was supposed to get back list of stronglyconnected l comps . def main(args: Array[String]) { val vertexArray = Array( (1L, (Alice, 28)), (2L, (Bob, 27)), (3L, (Charlie, 65)), (4L, (David, 42)), (5L, (Ed,

Re: DataFrame column structure change

2015-08-07 Thread Rishabh Bhardwaj
I am doing it by creating a new data frame out of the fields to be nested and then join with the original DF. Looking for some optimized solution here. On Fri, Aug 7, 2015 at 2:06 PM, Rishabh Bhardwaj rbnex...@gmail.com wrote: Hi all, I want to have some nesting structure from the existing

automatically determine cluster number

2015-08-07 Thread Ziqi Zhang
Hi I am new to spark and I need to use the clustering functionality to process large dataset. There are between 50k and 1mil objects to cluster. However the problem is that the optimal number of clusters is unknown. we cannot even estimate a range, except we know there are N objects.

Insert operation in Dataframe

2015-08-07 Thread guoqing0...@yahoo.com.hk
Hi all , Is the Dataframe support the insert operation , like sqlContext.sql(insert into table1 xxx select xxx from table2) ? guoqing0...@yahoo.com.hk

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Philip Weaver
If the object cannot be serialized, then I don't think broadcast will make it magically serializable. You can't transfer data structures between nodes without serializing them somehow. On Fri, Aug 7, 2015 at 7:31 AM, Sujit Pal sujitatgt...@gmail.com wrote: Hi Hao, I think sc.broadcast will

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Han JU
If the object is something like an utility object (say a DB connection handler), I often use: @transient lazy val someObj = MyFactory.getObj(...) So basically `@transient` tell the closure cleaner don't serialize this, and the `lazy val` allows it to be initiated on each executor upon its

Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Hien Luu
This blog outlines a few things that make Spark faster than MapReduce - https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html On Fri, Aug 7, 2015 at 9:13 AM, Muler mulugeta.abe...@gmail.com wrote: Consider the classic word count application over a 4 node cluster with a sizable

Re: log4j.xml bundled in jar vs log4.properties in spark/conf

2015-08-07 Thread mlemay
See post for detailed explanation of you problem: http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tt24159.html -- View this message in context:

distributing large matrices

2015-08-07 Thread iceback
Is this the sort of problem spark can accommodate? I need to compare 10,000 matrices with each other (10^10 comparison). The matrices are 100x10 (10^7 int values). I have 10 machines with 2 to 8 cores (8-32 processors). All machines have to - contribute to matrices generation (a

Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Corey Nolet
1) Spark only needs to shuffle when data needs to be partitioned around the workers in an all-to-all fashion. 2) Multi-stage jobs that would normally require several map reduce jobs, thus causing data to be dumped to disk between the jobs can be cached in memory.

Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Hi, I'm looking for open source workflow tools/engines that allow us to schedule spark jobs on a datastax cassandra cluster. Since there are tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to check with people here to see what they are using today. Some of the

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Eugene Morozov
Hao, I’d say there are few possible ways to achieve that: 1. Use KryoSerializer. The flaw of KryoSerializer is that current version (2.21) has an issue with internal state and it might not work for some objects. Spark get kryo dependency as transitive through chill and it’ll not be resolved

Re: Spark job workflow engine recommendations

2015-08-07 Thread Hien Luu
Looks like Oozie can satisfy most of your requirements. On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone vikramk...@gmail.com wrote: Hi, I'm looking for open source workflow tools/engines that allow us to schedule spark jobs on a datastax cassandra cluster. Since there are tonnes of

Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Muler
Consider the classic word count application over a 4 node cluster with a sizable working data. What makes Spark ran faster than MapReduce considering that Spark also has to write to disk during shuffle?

Re: Amazon DynamoDB Spark

2015-08-07 Thread Jay Vyas
In general the simplest way is that you can use the Dynamo Java API as is and call it inside a map(), and use the asynchronous put() Dynamo api call . On Aug 7, 2015, at 9:08 AM, Yasemin Kaya godo...@gmail.com wrote: Hi, Is there a way using DynamoDB in spark application? I have to

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Philip Weaver
Thanks, I also confirmed that the partition discovery is slow by writing a non-Spark application that uses the parquet library directly to load that partitions. It's so slow that my colleague's Python application can read the entire contents of all the parquet data files faster than my

Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Thanks for the suggestion Hien. I'm curious why not azkaban from linkedin. From what I read online Oozie was very cumbersome to setup and use compared to azkaban. Since you are from linkedin wanted to get some perspective on what it lacks compared to Oozie. Ease of use is very important more than

RE: Issue when rebroadcasting a variable outside of the definition scope

2015-08-07 Thread Ganelin, Ilya
Simone, here are some thoughts. Please check out the understanding closures section of the Spark Programming Guide. Secondly, broadcast variables do not propagate updates to the underlying data. You must either create a new broadcast variable or alternately if you simply wish to accumulate

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Sounds reasonable to me, feel free to create a JIRA (and PR if you're up for it) so we can see what others think! On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler gerald.loeff...@googlemail.com wrote: hi, if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0, doesn’t that make it

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Yep, I think that's what Gerald is saying and they are proposing to default miniBatchFraction = (1 / numInstances). Is that correct? On Fri, Aug 7, 2015 at 11:16 AM, Meihua Wu rotationsymmetr...@gmail.com wrote: I think in the SGD algorithm, the mini batch sample is done without replacement.

Spark SQL query AVRO file

2015-08-07 Thread java8964
Hi, Spark users: We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production cluster, which has 42 data/task nodes. There is one dataset stored as Avro files about 3T. Our business has a complex query running for the dataset, which is stored in nest structure with Array of

Re: Spark MLib v/s SparkR

2015-08-07 Thread Feynman Liang
SparkR and MLlib are becoming more integrated (we recently added R formula support) but the integration is still quite small. If you learn R and SparkR, you will not be able to leverage most of the distributed algorithms in MLlib (e.g. all the algorithms you cited). However, you could use the

Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Muler
Spark is an in-memory engine and attempts to do computation in-memory. Tachyon is memory-centeric distributed storage, OK, but how would that help ran Spark faster?

How to run start-thrift-server in debug mode?

2015-08-07 Thread Benjamin Ross
Hi, I'm trying to run the hive thrift server in debug mode. I've tried to simply pass -Xdebug -Xrunjdwp:transport=dt_socket,address=127.0.0.1:,server=y,suspend=n to start-thriftserver.sh as a driver option, but it doesn't seem to host a server. I've then tried to edit the various shell

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Ted Yu
Spark 1.4.1 depends on: akka.version2.3.4-spark/akka.version Is it possible that your standalone cluster has another version of akka ? Cheers On Fri, Aug 7, 2015 at 10:48 AM, Jeff Jones jjo...@adaptivebiotech.com wrote: Thanks. Added this to both the client and the master but still not

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Igor Berman
check on which ip/port master listens netstat -a -t --numeric-ports On 7 August 2015 at 20:48, Jeff Jones jjo...@adaptivebiotech.com wrote: Thanks. Added this to both the client and the master but still not getting any more information. I confirmed the flag with ps. jjones53222 2.7

Get bucket details created in shuffle phase

2015-08-07 Thread cheez
Hey all. I was trying to understand Spark Internals by looking in to (and hacking) the code. I was trying to explore the buckets which are generated when we partition the output of each map task and then let the reduce side fetch them on the basis of paritionId. I went into the write() method of

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964
Hi, Michael: I am not sure how spark-avro can help in this case. My understanding is that to use Spark-avro, I have to translate all the logic from this big Hive query into Spark code, right? If I have this big Hive query, how I can use spark-avro to run the query? Thanks Yong From:

Re: tachyon

2015-08-07 Thread Ted Yu
Looks like you would get better response on Tachyon's mailing list: https://groups.google.com/forum/?fromgroups#!forum/tachyon-users Cheers On Fri, Aug 7, 2015 at 9:56 AM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: Do people use Tachyon in production, or is it experimental

[Spark Streaming] Session based windowing like in google dataflow

2015-08-07 Thread Ankur Chauhan
Hi all, I am trying to figure out how to perform equivalent of Session windows (as mentioned in https://cloud.google.com/dataflow/model/windowing) using spark streaming. Is it even possible (i.e. possible to do efficiently at scale). Just to expand on the definition: Taken from the google

Fwd: [Spark + Hive + EMR + S3] Issue when reading from Hive external table backed on S3 with large amount of small files

2015-08-07 Thread Roberto Coluccio
Please community, I'd really appreciate your opinion on this topic. Best regards, Roberto -- Forwarded message -- From: Roberto Coluccio roberto.coluc...@gmail.com Date: Sat, Jul 25, 2015 at 6:28 PM Subject: [Spark + Hive + EMR + S3] Issue when reading from Hive external table

Re: Spark job workflow engine recommendations

2015-08-07 Thread Jörn Franke
Check also falcon in combination with oozie Le ven. 7 août 2015 à 17:51, Hien Luu h...@linkedin.com.invalid a écrit : Looks like Oozie can satisfy most of your requirements. On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone vikramk...@gmail.com wrote: Hi, I'm looking for open source workflow

SparkSQL: remove jar added by add jar command from dependencies

2015-08-07 Thread Wu, James C.
Hi, I am using Spark SQL to run some queries on a set of avro data. Somehow I am getting this error 0: jdbc:hive2://n7-z01-0a2a1453 select count(*) from flume_test; Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 26.0 failed 4 times, most recent

tachyon

2015-08-07 Thread Abhishek R. Singh
Do people use Tachyon in production, or is it experimental grade still? Regards, Abhishek - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark job workflow engine recommendations

2015-08-07 Thread Ted Yu
From what I heard (an ex-coworker who is Oozie committer), Azkaban is being phased out at LinkedIn because of scalability issues (though UI-wise, Azkaban seems better). Vikram: I suggest you do more research in related projects (maybe using their mailing lists). Disclaimer: I don't work for

Re: Spark SQL query AVRO file

2015-08-07 Thread Michael Armbrust
You can register your data as a table using this library and then query it using HiveQL CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS (path src/test/resources/episodes.avro) On Fri, Aug 7, 2015 at 11:42 AM, java8964 java8...@hotmail.com wrote: Hi, Michael: I am not

RE: All masters are unresponsive! Giving up.

2015-08-07 Thread Jeff Jones
Thanks. Added this to both the client and the master but still not getting any more information. I confirmed the flag with ps. jjones53222 2.7 0.1 19399412 549656 pts/3 Sl 17:17 0:44 /opt/jdk1.8/bin/java -cp

Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Calvin Jia
Hi, Tachyon http://tachyon-project.org manages memory off heap which can help prevent long GC pauses. Also, using Tachyon will allow the data to be shared between Spark jobs if they use the same dataset. Here's http://www.meetup.com/Tachyon/events/222485713/ a production use case where Baidu

Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Oh ok. That's a good enough reason against azkaban then. So looks like Oozie is the best choice here. On Friday, August 7, 2015, Ted Yu yuzhih...@gmail.com wrote: From what I heard (an ex-coworker who is Oozie committer), Azkaban is being phased out at LinkedIn because of scalability issues

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Meihua Wu
I think in the SGD algorithm, the mini batch sample is done without replacement. So with fraction=1, then all the rows will be sampled exactly once to form the miniBatch, resulting to the deterministic/classical case. On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang fli...@databricks.com wrote:

Re: Amazon DynamoDB Spark

2015-08-07 Thread Yasemin Kaya
Thanx Jay. 2015-08-07 19:25 GMT+03:00 Jay Vyas jayunit100.apa...@gmail.com: In general the simplest way is that you can use the Dynamo Java API as is and call it inside a map(), and use the asynchronous put() Dynamo api call . On Aug 7, 2015, at 9:08 AM, Yasemin Kaya godo...@gmail.com

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964
Good to know that. Let me research it and give it a try. Thanks Yong From: mich...@databricks.com Date: Fri, 7 Aug 2015 11:44:48 -0700 Subject: Re: Spark SQL query AVRO file To: java8...@hotmail.com CC: user@spark.apache.org You can register your data as a table using this library and then query

RE: distributing large matrices

2015-08-07 Thread Koen Vantomme
Verzonden vanaf mijn Sony Xperia™-smartphone iceback schreef Is this the sort of problem spark can accommodate? I need to compare 10,000 matrices with each other (10^10 comparison). The matrices are 100x10 (10^7 int values). I have 10 machines with 2 to 8 cores (8-32

RE: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Koen Vantomme
Verzonden vanaf mijn Sony Xperia™-smartphone saif.a.ell...@wellsfargo.com schreef !-- /* Font Definitions */ @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} /* Style Definitions */

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Good point; I agree that defaulting to online SGD (single example per iteration) would be a poor UX due to performance. On Fri, Aug 7, 2015 at 12:44 PM, Meihua Wu rotationsymmetr...@gmail.com wrote: Feynman, thanks for clarifying. If we default miniBatchFraction = (1 / numInstances), then we

SparkSQL: add jar blocks all queries

2015-08-07 Thread Wu, James C.
Hi, I got into a situation where a prior add jar command causing Spark SQL stops to work for all users. Does anyone know how to fix the issue? Regards, james From: Wu, Walt Disney james.c...@disney.commailto:james.c...@disney.com Date: Friday, August 7, 2015 at 10:29 AM To:

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Meihua Wu
Feynman, thanks for clarifying. If we default miniBatchFraction = (1 / numInstances), then we will only hit one row for every iteration of SGD regardless the number of partitions and executors. In other words the parallelism provided by the RDD is lost in this approach. I think this is something

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Koen Vantomme
Verzonden vanaf mijn Sony Xperia™-smartphone Meihua Wu schreef Feynman, thanks for clarifying. If we default miniBatchFraction = (1 / numInstances), then we will only hit one row for every iteration of SGD regardless the number of partitions and executors. In other words the

Re: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread François Pelletier
look at spark.history.ui.port, if you use standalone spark.yarn.historyServer.address, if you use YARN in your Spark config file Mine is located at /etc/spark/conf/spark-defaults.conf If you use Apache Ambari you can find this settings in the Spark / Configs / Advanced spark-defaults tab

Fwd: spark config

2015-08-07 Thread Bryce Lobdell
I Recently downloaded spark package 1.4.0: A build of Spark with sbt/sbt clean assembly failed with message Error: Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar Upon investigation I figured out that sbt-launch-0.13.7.jar is downloaded at build time and that it contained the the

Re: Time series forecasting

2015-08-07 Thread ploffay
Im interested in machine learning on time series. In our environment we have lot of metric data continuously coming from agents. Data are stored in Cassandra. Is it possible to set up spark that would use machine learning on previous data and new incoming data? -- View this message in

Amazon DynamoDB Spark

2015-08-07 Thread Yasemin Kaya
Hi, Is there a way using DynamoDB in spark application? I have to persist my results to DynamoDB. Thanx, yasemin -- hiç ender hiç

Re: SparkR -Graphx Connected components

2015-08-07 Thread Robineast
Hi The graph returned by SCC (strong_graphs in your code) has vertex data where each vertex in a component is assigned the lowest vertex id of the component. So if you have 6 vertices (1 to 6) and 2 strongly connected components (1 and 3, and 2,4,5 and 6) then the strongly connected components

Issues with Phoenix 4.5

2015-08-07 Thread Nicola Ferraro
Hi all, I am getting an exception when trying to execute a Spark Job that is using the new Phoenix 4.5 spark connector. The application works very well in my local machine, but fails to run in a cluster environment on top of yarn. The cluster is a Cloudera CDH 5.4.4 with HBase 1.0.0 and Phoenix

Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
Looking at the callstack and diffs between 1.3.1 and 1.4.1-rc4, I see something that could be relevant to the issue. 1) Callstack tells that log4j manager gets initialized and uses default java context class loader. This context class loader should probably be MutableURLClassLoader from spark but

  1   2   >