Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Hm, now I wonder if it's the same issue here: https://issues.apache.org/jira/browse/SPARK-10149 Does the setting described there help? On Mon, Oct 26, 2015 at 11:39 AM, Jinfeng Li wrote: > Hi, I have already tried the same code with Spark 1.3.1, there is no such > problem.

Re: [Spark Streaming] How do we reset the updateStateByKey values.

2015-10-26 Thread Adrian Tanase
Have you considered union-ing the 2 streams? Basically you can consider them as 2 “message types” that your update function can consume (e.g. implement a common interface): * regularUpdate * resetStateUpdate Inside your updateStateByKey you can check if any of the messages in the list

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-26 Thread Zhiliang Zhu
Hi Meihua, DB  Tsai, Thanks very much for your all kind help.While I add some more LabeledPoint  in the training data, then the output result also seems much better. I will also try setFitIntercept(false) way . Currently I encounted some problem about algorithm optimization issue: f(x1, x2,

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
-dev +user How are you measuring network traffic? It's not in general true that there will be zero network traffic, since not all executors are local to all data. That can be the situation in many cases but not always. On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li wrote: > Hi,

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Hm, how about the opposite question -- do you have just 1 executor? then again everything will be remote except for a small fraction of blocks. On Mon, Oct 26, 2015 at 9:28 AM, Jinfeng Li wrote: > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, > data

Re: Secondary Sorting in Spark

2015-10-26 Thread Adrian Tanase
Do you have a particular concern? You’re always using a partitioner (default is HashPartitioner) and the Partitioner interface is pretty light, can’t see how it could affect performance. Used correctly it should improve performance as you can better control placement of data and avoid

Re: [SPARK STREAMING] Concurrent operations in spark streaming

2015-10-26 Thread Adrian Tanase
If I understand the order correctly, not really. First of all, the easiest way to make sure it works as expected is to check out the visual DAG in the spark UI. It should map 1:1 to your code, and since I don’t see any shuffles in the operations below it should execute all in one stage,

Re: Accumulators internals and reliability

2015-10-26 Thread Adrian Tanase
I can reply from an user’s perspective – I defer to semantic guarantees to someone with more experience. I’ve successfully implemented the following using a custom Accumulable class: * Created a MapAccumulator with dynamic keys (they are driven by the data coming in), as opposed to

Re: [SPARK STREAMING] Concurrent operations in spark streaming

2015-10-26 Thread Adrian Tanase
Thinking more about it – it should only be 2 tasks as A and B are most likely collapsed by spark in a single task. Again – learn to use the spark UI as it’s really informative. The combination of DAG visualization and task count should answer most of your questions. -adrian From: Adrian

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
I use standalone mode. Each machine has 4 workers. Spark is deployed correctly as webUI and jps command can show that. Actually, we are a team and already use spark for nearly half a year, started from Spark 1.3.1. We find this problem on one of our application and I write a simple program to

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Yeah, are these stats actually reflecting data read locally, like through the loopback interface? I'm also no expert on the internals here but this may be measuring effectively local reads. Or are you sure it's not? On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran wrote:

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Steve Loughran
> On 26 Oct 2015, at 09:28, Jinfeng Li wrote: > > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data > is evenly distributed among 18 machines. > every block in HDFS (usually 64-128-256 MB) is distributed across three machines, meaning 3

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data is evenly distributed among 18 machines. On Mon, Oct 26, 2015 at 5:18 PM Sean Owen wrote: > Have a look at your HDFS replication, and where the blocks are for these > files. For example, if you had

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
Hi, I have already tried the same code with Spark 1.3.1, there is no such problem. The configuration files are all directly copied from Spark 1.5.1. I feel it is a bug on Spark 1.5.1. Thanks a lot for your response. On Mon, Oct 26, 2015 at 7:21 PM Sean Owen wrote: > Yeah,

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Fengdong Yu
How many partitions you generated? if Millions generated, then there is a huge memory consumed. > On Oct 26, 2015, at 10:58 AM, Jerry Lam wrote: > > Hi guys, > > I mentioned that the partitions are generated so I tried to read the > partition data from it. The driver

Re: [Yarn-Client]Can not access SparkUI

2015-10-26 Thread Deng Ching-Mallete
Hi Earthson, Unfortunately, attachments aren't allowed in the list so they seemed to have been removed from your email. Anyway, what happens when you click the ApplicationMaster link? Thanks, Deng On Mon, Oct 26, 2015 at 2:21 PM, Earthson wrote: > We are using Spark

Re: [Yarn-Client]Can not access SparkUI

2015-10-26 Thread Earthson Lu
it blocks until 500 Error: HTTP ERROR 500 Problem accessing /proxy/application_1443146594954_0057/. Reason: (TimeOut, I guess) Caused by: java.net.ConnectException: at java.net.PlainSocketImpl.socketConnect(Native Method) at

Accumulators internals and reliability

2015-10-26 Thread Sela, Amit
It seems like there is not much literature about Spark's Accumulators so I thought I'd ask here: Do Accumulators reside in a Task ? Are they being serialized with the task ? Sent back on task completion as part of the ResultTask ? Are they reliable ? If so, when ? Can I relay on accumulators

[Yarn-Client]Can not access SparkUI

2015-10-26 Thread Earthson
We are using Spark 1.5.1 with `--master yarn`, Yarn RM is running in HA mode. direct visit click ApplicationMaster link YARN RM log -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Client-Can-not-access-SparkUI-tp25197.html Sent from the

Re: [Yarn-Client]Can not access SparkUI

2015-10-26 Thread syepes
Hello Earthson, Is you cluster multihom​ed​? If yes, try setting the variables SPARK_LOCAL_{IP,HOSTNAME} I had this issue before: https://issues.apache.org/jira/browse/SPARK-11147 -- View this message in context:

Re: Large number of conf broadcasts

2015-10-26 Thread Anders Arpteg
Nice Koert, lets hope it gets merged soon. /Anders On Fri, Oct 23, 2015 at 6:32 PM Koert Kuipers wrote: > https://github.com/databricks/spark-avro/pull/95 > > On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote: > >> oh no wonder... it undoes the glob (i

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
I cat /proc/net/dev and then take the difference of received bytes before and after the job. I also see a long-time peak (nearly 600Mb/s) in nload interface. We have 18 machines and each machine receives 4.7G bytes. On Mon, Oct 26, 2015 at 5:00 PM Sean Owen wrote: > -dev

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Have a look at your HDFS replication, and where the blocks are for these files. For example, if you had only 2 HDFS data nodes, then data would be remote to 16 of 18 workers and always entail a copy. On Mon, Oct 26, 2015 at 9:12 AM, Jinfeng Li wrote: > I cat /proc/net/dev and

Re: get host from rdd map

2015-10-26 Thread Deenar Toraskar
1. You can call any api that returns you the hostname in your map function. Here's a simplified example, You would generally use mapPartitions as it will save the overhead of retrieving hostname multiple times 2. 3. import scala.sys.process._ 4. val distinctHosts =

Re: Spark scala REPL - Unable to create sqlContext

2015-10-26 Thread Deenar Toraskar
Embedded Derby, which Hive/Spark SQL uses as the default metastore only supports a single user at a time. Till this issue is fixed, you could use another metastore that supports multiple concurrent users (e.g. networked derby or mysql) to get around it. On 25 October 2015 at 16:15, Ge, Yao (Y.)

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-26 Thread Zhiliang Zhu
Hi Meihua, I just found that setFitIntercept(false) is introduced since Spark 1.5.0, my current version is 1.4.0 . I shall also try that after update the version . Since you said brezee is probably used, I knew brezee is used under the bottom of spark ml.Would you help comment some more how to

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
The input data is a number of 16M files. On Mon, Oct 26, 2015 at 5:12 PM Jinfeng Li wrote: > I cat /proc/net/dev and then take the difference of received bytes before > and after the job. I also see a long-time peak (nearly 600Mb/s) in nload > interface. We have 18 machines

Kryo makes String data invalid

2015-10-26 Thread Saif.A.Ellafi
Hi all, I have a parquet file, which I am loading in a shell. When I launch the shell with -driver-java-options ="-Dspark.serializer=...kryo", makes a couple fields look like: 03-?? ??-?? ??-??? when calling > data.first I will confirm briefly, but I am utterly sure it happens

RE: Dynamic Resource Allocation with Spark Streaming (Standalone Cluster, Spark 1.5.1)

2015-10-26 Thread Silvio Fiorito
Hi Matthias, Unless there was a change in 1.5, I'm afraid dynamic resource allocation is not yet supported in streaming apps. Thanks, Silvio Sent from my Lumia 930 From: Matthias Niehoff Sent: ‎10/‎26/‎2015 4:00 PM To:

Joining large data sets

2015-10-26 Thread Bryan
Hello. What is the suggested practice for joining two large data streams? I am currently simply mapping out the key tuple on both streams then executing a join. I have seen several suggestions for broadcast joins that seem to be targeted at a joining a larger data set to a small set

Re: Broadcast table

2015-10-26 Thread Jags Ramnarayanan
If you are using Spark SQL and joining two dataFrames the optimizer would automatically broadcast the smaller table (You can configure the size if the default is too small). Else, in code, you can collect any RDD to the driver and broadcast using the context.broadcast method.

Dynamic Resource Allocation with Spark Streaming (Standalone Cluster, Spark 1.5.1)

2015-10-26 Thread Matthias Niehoff
Hello everybody, I have a few (~15) Spark Streaming jobs which have load peaks as well as long times with a low load. So I thought the new Dynamic Resource Allocation for Standalone Clusters might be helpful (SPARK-4751). I have a test "cluster" with 1 worker consisting of 4 executors with 2

Custom function to operate on Dataframe Window

2015-10-26 Thread aaryabhatta
Hi, Is there a way to create a custom function in Pyspark to operate on a dataframe window. For example, similar to rank() function that outputs the rank within that Window. If it can only be done in Scala / Java, may I know how? Thanks. Regards -- View this message in context:

Submitting Spark Applications - Do I need to leave ports open?

2015-10-26 Thread markluk
I want to submit interactive applications to a remote Spark cluster running in standalone mode. I understand I need to connect to master's 7077 port. It also seems like the master node need to open connections to my local machine. And the ports that it needs to open are different every time.

Re: Maven build failed (Spark master)

2015-10-26 Thread Kayode Odeyemi
I used this command which is synonymous to what you have: ./make-distribution.sh --name spark-latest --tgz --mvn mvn -Dhadoop.version=2.6.0 -Phadoop-2.6 -Phive -Phive-thriftserver -DskipTests clean package -U But I still see WARNINGS like this in the output and no .gz file created: cp:

Re: Maven build failed (Spark master)

2015-10-26 Thread Ted Yu
Looks like '-Pyarn' was missing in your command. On Mon, Oct 26, 2015 at 12:06 PM, Kayode Odeyemi wrote: > I used this command which is synonymous to what you have: > > ./make-distribution.sh --name spark-latest --tgz --mvn mvn > -Dhadoop.version=2.6.0 -Phadoop-2.6 -Phive

Re: Spark with business rules

2015-10-26 Thread Jörn Franke
Maybe SparkR? What languages do your Users speak? > On 26 Oct 2015, at 23:12, danilo wrote: > > Hi All, I want to create a monitoring tool using my sensor data. I receive > the events every seconds and I need to create a report using node.js. Right > now I created my kpi

Re: Dynamic Resource Allocation with Spark Streaming (Standalone Cluster, Spark 1.5.1)

2015-10-26 Thread Ted Yu
This is related: SPARK-10955 Warn if dynamic allocation is enabled for Streaming jobs which went into 1.6.0 as well. FYI On Mon, Oct 26, 2015 at 2:26 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Hi Matthias, > > Unless there was a change in 1.5, I'm afraid dynamic resource

Re: Spark with business rules

2015-10-26 Thread Holden Karau
Spark SQL seems like it might be the best interface if your users are already familiar with SQL. On Mon, Oct 26, 2015 at 3:12 PM, danilo wrote: > Hi All, I want to create a monitoring tool using my sensor data. I receive > the events every seconds and I need to create a

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Interesting. For feature sub-sampling, is it per-node or per-tree? Do you think you can implement generic GBM and have it merged as part of Spark codebase? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon,

Re: Secondary Sorting in Spark

2015-10-26 Thread swetha kasireddy
Right now my code does the following for grouping by sessionId(which is the key) and sorting by timestamp which is the first value in the tuple. The second value in the tuple is Json. def getGrpdAndSrtdSessions(rdd: RDD[(String, (Long, String))]): RDD[(String, List[(Long, String)])] = { val

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Also, does it support categorical feature? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai wrote: > Interesting. For feature sub-sampling, is it

Re: Spark Implementation of XGBoost

2015-10-26 Thread YiZhi Liu
There's an xgboost exploration jira SPARK-8547. Can it be a good start? 2015-10-27 7:07 GMT+08:00 DB Tsai : > Also, does it support categorical feature? > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP

Spark with business rules

2015-10-26 Thread danilo
Hi All, I want to create a monitoring tool using my sensor data. I receive the events every seconds and I need to create a report using node.js. Right now I created my kpi coding the formula directly in sparks. However I would like to make a layer where a not technical user can write simple

Results change in group by operation

2015-10-26 Thread Saif.A.Ellafi
Hello Everyone, I would need urgent help with a data consistency issue I am having. Stand alone Cluster of five servers. sqlContext instance of HiveContext (default in spark-shell) No special options other than driver memory and executor memory. Parquet partitions are 512 where there are 160

RE: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Cheng, Hao
I am not sure if we really want to support that with HiveContext, but a workround is to use the Spark package at https://github.com/databricks/spark-csv From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Tuesday, October 27, 2015 10:54 AM To: Daniel Haviv; user Subject: RE: HiveContext

RE: Concurrent execution of actions within a driver

2015-10-26 Thread Silvio Fiorito
There is a collectAsync action if you want to run them in parallel, but keep in mind the two jobs will need to share resources and you should use the FAIR scheduler. From: praveen S Sent: ‎10/‎26/‎2015 4:27 AM To:

Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi YiZhi, Thank you for mentioning the jira. I will add a note to the jira. Meihua On Mon, Oct 26, 2015 at 6:16 PM, YiZhi Liu wrote: > There's an xgboost exploration jira SPARK-8547. Can it be a good start? > > 2015-10-27 7:07 GMT+08:00 DB Tsai : >>

Re: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv
I will Thank you. > On 27 באוק׳ 2015, at 4:54, Felix Cheung wrote: > > Please open a JIRA? > > > Date: Mon, 26 Oct 2015 15:32:42 +0200 > Subject: HiveContext ignores ("skip.header.line.count"="1") > From: daniel.ha...@veracity-group.com > To: user@spark.apache.org

Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi DB Tsai, Thank you very much for your interest and comment. 1) feature sub-sample is per-node, like random forest. 2) The current code heavily exploits the tree structure to speed up the learning (such as processing multiple learning node in one pass of the training data). So a generic GBM

RE: Running in cluster mode causes native library linking to fail

2015-10-26 Thread prajod.vettiyattil
Hi Bernardo, Glad that our suggestions helped. A bigger thanks for sharing your solution with us. That was a tricky and difficult problem to track and solve ! Regards, Prajod From: Bernardo Vecchia Stein [mailto:bernardovst...@gmail.com] Sent: 26 October 2015 23:41 To: Prajod S Vettiyattil

RE: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Felix Cheung
Please open a JIRA? Date: Mon, 26 Oct 2015 15:32:42 +0200 Subject: HiveContext ignores ("skip.header.line.count"="1") From: daniel.ha...@veracity-group.com To: user@spark.apache.org Hi,I have a csv table in Hive which is configured to skip the header row using

Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi Spark User/Dev, Inspired by the success of XGBoost, I have created a Spark package for gradient boosting tree with 2nd order approximation of arbitrary user-defined loss functions. https://github.com/rotationsymmetry/SparkXGBoost Currently linear (normal) regression, binary classification,

Re: Maven build failed (Spark master)

2015-10-26 Thread Kayode Odeyemi
I see a lot of stuffs like this after the a successful maven build: cp: /usr/local/spark-latest/spark-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-spark-latest/python/test_support/sql/parquet_partitioned/year=2014/month=9/day=1/ part-r-8.gz.parquet: No such file

Re: Maven build failed (Spark master)

2015-10-26 Thread Ted Yu
If you use the command shown in: https://github.com/apache/spark/pull/9281 You should have got the following: ./dist/python/test_support/sql/parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet

Re: Error Compiling Spark 1.4.1 w/ Scala 2.11 & Hive Support

2015-10-26 Thread Bryan Jeffrey
All, The error resolved to a bad version of jline pulling from Maven. The jline version is defined as 'scala.version' -- the 2.11 version does not exist in maven. Instead the following should be used: org.scala-lang jline 2.11.0-M3 Regards, Bryan Jeffrey

RE: Loading binary files from NFS share

2015-10-26 Thread Andrianasolo Fanilo
Hi again, I found this : https://github.com/NetApp/NetApp-Hadoop-NFS-Connector Maybe it will enable you to read NFS data from Spark at least. Anyone from the community used it ? BR, Fanilo De : Andrianasolo Fanilo Envoyé : lundi 26 octobre 2015 15:24 À : 'Kayode Odeyemi'; user Objet : RE:

Loading binary files from NFS share

2015-10-26 Thread Kayode Odeyemi
Hi, Is it possible to load binary files from NFS share like this: sc.binaryFiles("nfs://host/mountpath") I understand that it takes a path, but want to know if it allows protocol. Appreciate your help.

RE: Loading binary files from NFS share

2015-10-26 Thread Andrianasolo Fanilo
Hi, I believe binaryFiles uses a custom Hadoop Input Format, so it can only read specific Hadoop protocols. You can find the full list of supported protocols by typing “Hadoop filesystems hdfs hftp” in Google (the link I found is a little bit long and references the Hadoop Definitive Guide,

Re: Error Compiling Spark 1.4.1 w/ Scala 2.11 & Hive Support

2015-10-26 Thread Sean Owen
Did you switch the build to Scala 2.11 by running the script in dev/? It won't work otherwise, but does work if you do. @Ted 2.11 was supported in 1.4, not just 1.5. On Mon, Oct 26, 2015 at 2:13 PM, Bryan Jeffrey wrote: > All, > > The error resolved to a bad version of

Re: Anyone feels sparkSQL in spark1.5.1 very slow?

2015-10-26 Thread Yin Huai
@filthysocks, can you get the output of jmap -histo before the OOM ( http://docs.oracle.com/javase/7/docs/technotes/tools/share/jmap.html)? On Mon, Oct 26, 2015 at 6:35 AM, filthysocks wrote: > We upgrade from 1.4.1 to 1.5 and it's a pain > see > >

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
Hi, yes, it should be the same issue, but the solution doesn't apply in our situation. Anyway, thanks a lot for your replies. On Mon, Oct 26, 2015 at 7:44 PM Sean Owen wrote: > Hm, now I wonder if it's the same issue here: > https://issues.apache.org/jira/browse/SPARK-10149

Error Compiling Spark 1.4.1 w/ Scala 2.11 & Hive Support

2015-10-26 Thread Bryan Jeffrey
All, I'm seeing the following error compiling Spark 1.4.1 w/ Scala 2.11 & Hive support. Any ideas? mvn -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive -Phive-thriftserver package [INFO] Spark Project Parent POM .. SUCCESS [4.124s] [INFO] Spark Launcher

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Jerry Lam
Hi Fengdong, Why it needs more memory at the driver side when there are many partitions? It seems the implementation can only support use cases for a dozen of partition when it is over 100, it fails apart. It is also quite slow to initialize the loading of partition tables when the number of

Spark 1.5.1 hadoop 2.4 does not clear hive staging files after job finishes

2015-10-26 Thread unk1102
Hi I have spark job which creates hive table partitions I have switched to in spark 1.5.1 and spark 1.5.1 creates so many hive staging files and it doesn't delete it after job finishes. Is it a bug or do I need to disable something to prevents hive staging files from getting created or at least

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Steve Loughran
On 26 Oct 2015, at 11:21, Sean Owen > wrote: Yeah, are these stats actually reflecting data read locally, like through the loopback interface? I'm also no expert on the internals here but this may be measuring effectively local reads. Or are you

Re: Error Compiling Spark 1.4.1 w/ Scala 2.11 & Hive Support

2015-10-26 Thread Ted Yu
Scala 2.11 is supported in 1.5.1 release: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-parent_2.11%22 Can you upgrade ? Cheers On Mon, Oct 26, 2015 at 6:01 AM, Bryan Jeffrey wrote: > All, > > I'm seeing the following error compiling Spark 1.4.1 w/ Scala

Re: Maven build failed (Spark master)

2015-10-26 Thread Kayode Odeyemi
Hi, The ./make_distribution task completed. However, I can't seem to locate the .tar.gz file. Where does Spark save this? or should I just work with the dist directory? On Fri, Oct 23, 2015 at 4:23 PM, Kayode Odeyemi wrote: > I saw this when I tested manually (without

HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv
Hi, I have a csv table in Hive which is configured to skip the header row using TBLPROPERTIES("skip.header.line.count"="1"). When querying from Hive the header row is not included in the data, but when running the same query via HiveContext I get the header row. I made sure that HiveContext sees

Re: Anyone feels sparkSQL in spark1.5.1 very slow?

2015-10-26 Thread filthysocks
We upgrade from 1.4.1 to 1.5 and it's a pain see http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-5-1-driver-memory-problems-while-doing-Cross-Validation-do-not-occur-with-1-4-1-td25076.html -- View this message in context:

Re: Maven build failed (Spark master)

2015-10-26 Thread Yana Kadiyska
In 1.4 ./make_distribution produces a .tgz file in the root directory (same directory that make_distribution is in) On Mon, Oct 26, 2015 at 8:46 AM, Kayode Odeyemi wrote: > Hi, > > The ./make_distribution task completed. However, I can't seem to locate the > .tar.gz file. >

Re: Spark scala REPL - Unable to create sqlContext

2015-10-26 Thread Richard Hillegas
Note that embedded Derby supports multiple, simultaneous connections, that is, multiple simultaneous users. But a Derby database is owned by the process which boots it. Only one process can boot a Derby database at a given time. The creation of multiple SQL contexts must be spawning multiple

correct and fast way to stop streaming application

2015-10-26 Thread Krot Viacheslav
Hi all, I wonder what is the correct way to stop streaming application if some job failed? What I have now: val ssc = new StreamingContext ssc.start() try { ssc.awaitTermination() } catch { case e => ssc.stop(stopSparkContext = true, stopGracefully = false) } It works but one problem

RE: Problem with make-distribution.sh

2015-10-26 Thread java8964
Maybe you need the Hive part? Yong Date: Mon, 26 Oct 2015 11:34:30 -0400 Subject: Problem with make-distribution.sh From: yana.kadiy...@gmail.com To: user@spark.apache.org Hi folks, building spark instructions (http://spark.apache.org/docs/latest/building-spark.html) suggest that

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska
thank you so much! You are correct. This is the second time I've made this mistake :( On Mon, Oct 26, 2015 at 11:36 AM, java8964 wrote: > Maybe you need the Hive part? > > Yong > > -- > Date: Mon, 26 Oct 2015 11:34:30 -0400 > Subject: Problem

Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska
Hi folks, building spark instructions ( http://spark.apache.org/docs/latest/building-spark.html) suggest that ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn should produce a distribution similar to the ones found on the "Downloads" page. I noticed that the tgz I built

Spark Streaming: how to use StreamingContext.queueStream with existing RDD

2015-10-26 Thread Anfernee Xu
Hi, Here's my situation, I have some kind of offline dataset and got them loaded them into Spark as RDD, but I want to form a virtual data stream feeding to Spark Streaming, my code looks like this // sort offline data by time, the dataset spans 2 hours 1) JavaRDD sortedByTime =

Re: correct and fast way to stop streaming application

2015-10-26 Thread varun sharma
+1, wanted to do same. On Mon, Oct 26, 2015 at 8:58 PM, Krot Viacheslav wrote: > Hi all, > > I wonder what is the correct way to stop streaming application if some job > failed? > What I have now: > > val ssc = new StreamingContext > > ssc.start() > try { >

Re: Problem with make-distribution.sh

2015-10-26 Thread Sean Owen
I don't think the page suggests that gives you any of the tarballs on the downloads page, and -Phive does not by itself do so either. On Mon, Oct 26, 2015 at 4:58 PM, Ted Yu wrote: > I logged SPARK-11318 with a PR. > > I verified that by adding -Phive the datanucleus jars

Re: Spark Streaming: how to use StreamingContext.queueStream with existing RDD

2015-10-26 Thread Dean Wampler
Check out StreamingContext.queueStream ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext ) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Re: [Spark Streaming] How do we reset the updateStateByKey values.

2015-10-26 Thread Uthayan Suthakar
Thank you Adrian for your reply. I've already managed to resolve this issue and coincidently it is similar to the solution that you've proposed. Cheers, Uthay. On 26 October 2015 at 10:41, Adrian Tanase wrote: > Have you considered union-ing the 2 streams? Basically you can

rdd conversion

2015-10-26 Thread Yasemin Kaya
Hi, I have *JavaRDD>>* and I want to convert every map to pairrdd, i mean * JavaPairRDD. * There is a loop in list to get the indexed map, when I write code below, it returns me only one rdd. JavaPairRDD mapToRDD =

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Koert Kuipers
it seems HadoopFsRelation keeps track of all part files (instead of just the data directories). i believe this has something to do with parquet footers but i didnt bother to look more into it. but yet the result is that driver side it: 1) tries to keep track of all part files in a Map[Path,

Re: Problem with make-distribution.sh

2015-10-26 Thread Ted Yu
I logged SPARK-11318 with a PR. I verified that by adding -Phive the datanucleus jars are included: tar tzvf spark-1.6.0-SNAPSHOT-bin-custom-spark.tgz | grep datanucleus -rw-r--r-- hbase/hadoop 1890075 2015-10-26 09:52 spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-core-3.2.10.jar

Re: rdd conversion

2015-10-26 Thread Ted Yu
bq. t = new Tuple2 (entry.getKey(), entry.getValue()); The return statement is outside the loop. That was why you got one RDD. On Mon, Oct 26, 2015 at 9:40 AM, Yasemin Kaya wrote: > Hi, > > I have *JavaRDD>>* and I want to >

Re: rdd conversion

2015-10-26 Thread Yasemin Kaya
But if I put the return inside loop, method still wants me a return statement. 2015-10-26 19:09 GMT+02:00 Ted Yu : > bq. t = new Tuple2 (entry.getKey(), > entry.getValue()); > > The return statement is outside the loop. > That was why you got one RDD. >

Re: Concurrent execution of actions within a driver

2015-10-26 Thread Rishitesh Mishra
Spark executes tasks on an action. An action is broken down to multiple tasks. Multiple tasks from different actions run either in FIFO or FAIR mode depending on spark.scheduler.mode. Of course to get benefit of FAIR scheduling the two actions should be called by different threads. On Mon, Oct

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska
@Sean, here is where I think it's a little misleading (underlining is mine): Building a Runnable Distribution To create a *Spark distribution like those distributed by the Spark Downloads page*, and that is laid out so as to be runnable, use

Broadcast table

2015-10-26 Thread Younes Naguib
Hi all, I use the thrift server, and I cache a table using "cache table mytab". Is there any sql to broadcast it too? Thanks Younes Naguib Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC H3G 1R8 Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 |

Re: Running in cluster mode causes native library linking to fail

2015-10-26 Thread Bernardo Vecchia Stein
Hello guys, After lots of time trying to make things work, I finally found what was causing the issue: I was calling the function from the library inside a map function, which caused the code inside it to be run in executors instead of the driver. Since only the driver had loaded the library,