pyspark pickle error when using itertools.groupby

2016-08-04 Thread 林家銘
Hi I wrote a map function to aggregate data in a partition, and this function using itertools.groupby for more than twice, then there comes the pickle error . Here is what I do ===Driver Code=== pair_count = df.mapPartitions(lambda iterable: pair_func_cnt(iterable)) pair_count.collection()

Re: Writing all values for same key to one file

2016-08-04 Thread rtijoriwala
Hi Colzer, Thanks for the response. My main question was about writing one file per "key" i.e. have a file with all values for a given key. So in the pseudo code that I have above, am I opening/creating the file in the right place?. Once the file is created and closed, I cannot append to it.

Re: [Spark 2.0] Problem with Spark Thrift Server show NULL instead of showing BIGINT value

2016-08-04 Thread Chanh Le
I checked with Spark 1.6.1 it still works fine. I also check out latest source code in Spark 2.0 branch and built and get the same issue. I think because of changing API to dataset in Spark 2.0? Regards, Chanh > On Aug 5, 2016, at 9:44 AM, Chanh Le wrote: > > Hi

Java and SparkSession

2016-08-04 Thread Andy Grove
>From some brief experiments using Java with Spark 2.0 it looks like Java developers should stick to SparkContext and SQLContext rather than using the new SparkSession API? It would be great if someone could confirm if that is the intention or not. Thanks, Andy. -- Andy Grove Chief Architect

Re: Regression in Java RDD sortBy() in Spark 2.0

2016-08-04 Thread Andy Grove
Moments after sending this I tracked down the issue to a subsequent transformation of .top(10) which ran without error in Spark 1.6 (but who knows how it was sorting since the POJO doesn't implement Comparable) whereas in Spark 2.0 it now fails if the POJO is not Comparable. The new behavior is

Regression in Java RDD sortBy() in Spark 2.0

2016-08-04 Thread Andy Grove
Hi, I have some working Java code with Spark 1.6 that I am upgrading to Spark 2.0 I have this valid RDD: JavaRDD popSummary I want to sort using a function I provide for performing comparisons: popSummary .sortBy((Function) p -> p.getMale() *

Re: [Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread ayan guha
select * from ( select col1 as old_st,col2 as person,lead(col2) over (partition by col2 order by timestamp) next_st from main_table ) m where next_st is not null This will give you old street to new street in one row. You can then join to lookup table. On Fri, Aug 5, 2016 at 12:48 PM, Divya

Re: [Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
based on the time stamp column On 5 August 2016 at 10:43, ayan guha wrote: > How do you know person1 is moving from street1 to street2 and not other > way around? Basically, how do you ensure the order of the rows as you have > written them? > > On Fri, Aug 5, 2016 at

Re: [Spark 2.0] Problem with Spark Thrift Server show NULL instead of showing BIGINT value

2016-08-04 Thread Chanh Le
Hi Nicholas, Thanks for the information. How did you solve the issue? Did you change the parquet file by renaming the column name? I used to change the column name when I create a table in Hive without changing the parquet file but it’s still showing NULL. The parquet files of mine quite big

Re: [Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread ayan guha
How do you know person1 is moving from street1 to street2 and not other way around? Basically, how do you ensure the order of the rows as you have written them? On Fri, Aug 5, 2016 at 12:16 PM, Divya Gehlot wrote: > Hi, > I am working with Spark 1.6 with scala and

[Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
Hi, I am working with Spark 1.6 with scala and using Dataframe API . I have a use case where I need to compare two rows and add entry in the new column based on the lookup table for example : My DF looks like : col1col2 newCol1 street1 person1 street2 person1

Re: Writing all values for same key to one file

2016-08-04 Thread colzer
for rdd, you can use `saveAsHadoopFile` with a Custom `MultipleOutputFormat` -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-all-values-for-same-key-to-one-file-tp27455p27483.html Sent from the Apache Spark User List mailing list archive at

Spark 1.6 Streaming delay after long run

2016-08-04 Thread Chan Chor Pang
after upgrade from Spark 1.5 to 1.6(CDH 5.6.0 -> 5.7.1) some of our streaming job getting delay after long run. with a little invesgation, here is what i found. - the same program have no problem with Spark 1.5 - we have two kind of streaming and only those with "updateStateByKey" was

singular value decomposition in Spark ML

2016-08-04 Thread Sandy Ryza
Hi, Is SVD or PCA in Spark ML (i.e. spark.ml parity with the mllib RowMatrix.computeSVD API) slated for any upcoming release? Many thanks for any guidance! -Sandy

Re: Writing all values for same key to one file

2016-08-04 Thread ayan guha
Partition your data using the key rdd.partitionByKey() On Fri, Aug 5, 2016 at 10:10 AM, rtijoriwala wrote: > Any recommendations? comments? > > > > -- > View this message in context: http://apache-spark-user-list. >

Re: Spark SQL Hive Authorization

2016-08-04 Thread arin.g
Any updates on this? I am also trying to install Ranger with Sparksql and I have the same issue with Spark 1.6, and Ranger 0.5.4. I have used the enable-plugin.sh script to activate the hive-ranger plugin, and verified that all the required configuration files are in spark/conf. Thanks, -Arin

Re: Writing all values for same key to one file

2016-08-04 Thread rtijoriwala
Any recommendations? comments? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-all-values-for-same-key-to-one-file-tp27455p27480.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Luis Mateos
Hi Jacek, I have not used Encoders before. Definitely this works! Thank you! Luis On 4 August 2016 at 18:23, Jacek Laskowski wrote: > On Thu, Aug 4, 2016 at 11:56 PM, luismattor wrote: > > > import java.sql.Timestamp > > case class MyProduct(t:

Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Michael Armbrust
Nullable is an optimization for Spark SQL. It is telling spark to not even do an if check when accessing that field. In this case, your data *is* nullable, because timestamp is an object in java and you could put null there. On Thu, Aug 4, 2016 at 2:56 PM, luismattor

Re: Explanation regarding Spark Streaming

2016-08-04 Thread Jacek Laskowski
On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller wrote: > and eventually you will run out of memory. Why? Mind elaborating? Jacek - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Jacek Laskowski
On Thu, Aug 4, 2016 at 11:56 PM, luismattor wrote: > import java.sql.Timestamp > case class MyProduct(t: Timestamp, a: Float) > val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF() > rdd.printSchema() > > The output is: > root > |-- t: timestamp (nullable

Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Jacek Laskowski
On Thu, Aug 4, 2016 at 11:56 PM, luismattor wrote: > How can I set the timestamp column to be NOT nullable? Hi, Given [1] it's not possible without defining your own Encoder for Dataset (that you use implicitly). It'd be something as follows: implicit def myEncoder:

Re: Explanation regarding Spark Streaming

2016-08-04 Thread Mich Talebzadeh
Also check spark UI streaming section for various helpful stats. by default it runs on 4040 but can change it by setting--conf "spark.ui.port=" HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

RE: Explanation regarding Spark Streaming

2016-08-04 Thread Mohammed Guller
The backlog will increase as time passes and eventually you will run out of memory. Mohammed Author: Big Data Analytics with Spark From: Saurav Sinha [mailto:sauravsinh...@gmail.com] Sent: Wednesday, August 3, 2016

Re: Questions about ml.random forest (only one decision tree?)

2016-08-04 Thread Robin East
All supervised learning algorithms in Spark work the same way. You provide a set of ‘features’ (X) and a corresponding label (y) as part of a pipeline and call the fit method on the pipeline. The output of this is a model. You can then provide new examples (new Xs) to a transform method on the

How to set nullable field when create DataFrame using case class

2016-08-04 Thread luismattor
Hi all, Consider the following case: import java.sql.Timestamp case class MyProduct(t: Timestamp, a: Float) val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF() rdd.printSchema() The output is: root |-- t: timestamp (nullable = true) |-- a: float (nullable = false) How can

Re: Add column sum as new column in PySpark dataframe

2016-08-04 Thread Mike Metzger
This is a little ugly, but it may do what you're after - df.withColumn('total', expr("+".join([col for col in df.columns]))) I believe this will handle null values ok, but will likely error if there are any string columns present. Mike On Thu, Aug 4, 2016 at 8:41 AM, Javier Rey

Symbol HasInputCol is inaccesible from this place

2016-08-04 Thread janardhan shetty
Version : 2.0.0-preview import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} class CustomTransformer(override val uid: String) extends Transformer with HasInputCol with HasOutputCol with DefaultParamsWritableimport

Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Mich Talebzadeh
Yes pretty straight forward define, register and use def cleanupCurrency (word : String) : Double = { word.toString.substring(1).replace(",", "").toDouble } sqlContext.udf.register("cleanupCurrency", cleanupCurrency(_:String)) val a = df.filter(col("Total") > "").map(p =>

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
fixed! after adding the option -DskipTests everything build ok. Thanks Sean for your help On Thu, Aug 4, 2016 at 8:18 PM, Richard Siebeling wrote: > I don't see any other errors, these are the last lines of the > make-distribution log. > Above these lines there are no

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
I don't see any other errors, these are the last lines of the make-distribution log. Above these lines there are no errors... [INFO] Building jar: /opt/mapr/spark/spark-2.0.0/common/network-yarn/target/spark-network-yarn_2.11-2.0.0-test-sources.jar [warn]

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Sean Owen
That message is a warning, not error. It is just because you're cross compiling with Java 8. If something failed it was elsewhere. On Thu, Aug 4, 2016, 07:09 Richard Siebeling wrote: > Hi, > > spark 2.0 with mapr hadoop libraries was succesfully build using the > following

Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
No, SQLContext is not disappearing. The top-level class is replaced by SparkSession, but you can always get the underlying context from the session. You can also use SparkSession.udf.register() , which is

Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Ben Teeuwen
Yes, but I don’t want to use it in a select() call. Either selectExpr() or spark.sql(), with the udf being called inside a string. Now I got it to work using "sqlContext.registerFunction('encodeOneHot_udf',encodeOneHot, VectorUDT())” But this sqlContext approach will disappear, right? So I’m

Re: Spark SQL and number of task

2016-08-04 Thread Marco Colombo
Thanks a lot. That was my suspect. Was is struggling me is that also in case of OR the pushdown is present is the explain plan from hive, while effectively is not performed by the client. Regards 2016-08-04 15:37 GMT+02:00 Yong Zhang : > The 2 plans look similar, but they

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
Yes agreed.It seems to be issue with mapping the text file contents to case classes, not sure though. On Thu, Aug 4, 2016 at 8:17 PM, $iddhe$h Divekar wrote: > Hi Deepak, > > My files are always > 50MB. > I would think there would be a small config to overcome this.

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread $iddhe$h Divekar
Hi Deepak, My files are always > 50MB. I would think there would be a small config to overcome this. Tried almost everything i could after searching online. Any help from the mailing list would be appreciated. On Thu, Aug 4, 2016 at 7:43 AM, Deepak Sharma wrote: > I am

Re: num-executors, executor-memory and executor-cores parameters

2016-08-04 Thread Mich Talebzadeh
This is A classic minefield of different explanation. Here we go this is mine. Local mode In this mode the driver program (SparkSubmit), the resource manager and executor all exist within the same JVM. The JVM itself is the worker thread. All local mode jobs run independently. There is no

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
I am facing the same issue with spark 1.5.2 If the file size that's being processed by spark , is of size 10-12 MB , it throws out of memory . But if the same file is within 5 MB limit , it runs fine. I am using spark configuration with 7GB of memory and 3 cores for executors in the cluster of 8

Re: how to run local[k] threads on a single core

2016-08-04 Thread Daniel Darabos
You could run the application in a Docker container constrained to one CPU with --cpuset-cpus ( https://docs.docker.com/engine/reference/run/#/cpuset-constraint). On Thu, Aug 4, 2016 at 8:51 AM, Sun Rui wrote: > I don’t think it possible as Spark does not support thread to

Raleigh, Durham, and around...

2016-08-04 Thread Jean Georges Perrin
Hi, With some friends, we try to develop the Apache Spark community in the Triangle area of North Carolina, USA. If you are from there, feel free to join our Slack team: http://oplo.io/td. Danny Siegle has also organized a lot of meet ups around the edX courses (see

Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread $iddhe$h Divekar
Hi, I am running spark jobs using apache oozie in yarn-client mode. My job.properties has sparkConf which gets used in workflow.xml. I have tried increasing MaxPermSize using sparkConf in job.properties but that is not resolving the issue. *sparkConf*=--verbose --driver-java-options

WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp
hi , with pyspark 2.0 i get this errors WindowsError: [Error 2] The system cannot find the file specified someone can help me to find solution thanks

Re: WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp
C:\Users\AppData\Local\Continuum\Anaconda2\python.exe C:/workspacecode/pyspark/pyspark/churn/test.py Traceback (most recent call last): File "C:/workspacecode/pyspark/pyspark/churn/test.py", line 5, in conf = SparkConf() File

Re: Add column sum as new column in PySpark dataframe

2016-08-04 Thread Mich Talebzadeh
sorry you want the sum for each row or sum for each Colum? assuming all rows are numeric Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
Have you looked at pyspark.sql.functions.udf and the associated examples? 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen 님이 작성: > Hi, > > I’d like to use a UDF in pyspark 2.0. As in .. > > > def squareIt(x): > return x * x > > # register the function and define return type >

Re: source code for org.spark-project.hive

2016-08-04 Thread Ted Yu
https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 FYI On Thu, Aug 4, 2016 at 6:23 AM, prabhat__ wrote: > hey > can anyone point me to the source code for the jars used with group-id > org.spark-project.hive. > This was previously maintained in the private

Re: source code for org.spark-project.hive

2016-08-04 Thread Prabhat Kumar Gupta
Thanks a lot. On Thu, Aug 4, 2016 at 7:16 PM, Ted Yu wrote: > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > > FYI > > On Thu, Aug 4, 2016 at 6:23 AM, prabhat__ > wrote: > >> hey >> can anyone point me to the source code for the

Add column sum as new column in PySpark dataframe

2016-08-04 Thread Javier Rey
Hi everybody, Sorry, I sent last mesage it was imcomplete this is complete: I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns. Suppose my dataframe had columns "a", "b", and "c". I know I can do this:

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
Sure, I understand there are some issues with handling this missing value situation in StringIndexer currently. Your workaround is not ideal but I see that it is probably the only mechanism available currently to avoid the problem. But the OOM issues seem to be more about the feature cardinality

num-executors, executor-memory and executor-cores parameters

2016-08-04 Thread Ashok Kumar
Hi I would like to know the exact definition for these three  parameters  num-executors executor-memory executor-cores for local, standalone and yarn modes I have looked at on-line doc but not convinced if I understand them correct. Thanking you 

Add column sum as new column in PySpark dataframe

2016-08-04 Thread Javier Rey
I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns. Suppose my dataframe had columns "a", "b", and "c". I know I can do this:

Re: Spark SQL and number of task

2016-08-04 Thread Yong Zhang
The 2 plans look similar, but they are big difference, if you also consider that your source is in fact from a no-sql DB, like C*. The OR plan has "Filter ((id#0L = 94) || (id#0L = 2))", which means the filter is indeed happening on Spark side, instead of on C* side. Which means to fulfill

source code for org.spark-project.hive

2016-08-04 Thread prabhat__
hey can anyone point me to the source code for the jars used with group-id org.spark-project.hive. This was previously maintained in the private repo of pwendell (https://github.com/pwendell/hive) which doesn't seem to be active now. where can i find the source code for group:

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
Hi Nick, Thanks for the suggestion. Reducing the dimensionality is an option, thanks, but let’s say I really want to do this :). The reason why it’s so big is that I’m unifying my training and test data, and I don’t want to drop rows in the test data just because one of the features was

registering udf to use in spark.sql('select...

2016-08-04 Thread Ben Teeuwen
Hi, I’d like to use a UDF in pyspark 2.0. As in .. def squareIt(x): return x * x # register the function and define return type …. spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as function_result from df’) _ How can I register the function? I only see

Using Spark 2.0 inside Docker

2016-08-04 Thread mhornbech
Hi We are currently running a setup with Spark 1.6.2 inside Docker. It requires the use of the HTTPBroadcastFactory instead of the default TorrentBroadcastFactory to avoid the use of random ports, that cannot be exposed through Docker. From the Spark 2.0 release notes I can see that the

Re: Spark SQL and number of task

2016-08-04 Thread Takeshi Yamamuro
Seems the performance difference comes from `CassandraSourceRelation`. I'm not familiar with the implementation though, I guess the filter `IN` is pushed down into the datasource and the other not. You'd better off checking performance metrics in webUI. // maropu On Thu, Aug 4, 2016 at 8:41 PM,

How to avoid sql injection on SparkSQL?

2016-08-04 Thread Linyuxin
Hi All, I want to know how to avoid sql injection on SparkSQL Is there any common pattern about this? e.g. some useful tool or code segment or just create a “wheel” on SparkSQL myself. Thanks.

Re: Spark SQL and number of task

2016-08-04 Thread Marco Colombo
Ok, thanx. The 2 plan are very similar with in condition +--+--+ | plan |

Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?

2016-08-04 Thread dueckm
Hello, I built a prototype that uses join and groupBy operations via Spark RDD API. Recently I migrated it to the Dataset API. Now it runs much slower than with the original RDD implementation. Did I do something wrong here? Or is this a price I have to pay for the more convienient API? Is there

Re: Spark SQL and number of task

2016-08-04 Thread Takeshi Yamamuro
Hi, Please type `sqlCtx.sql("select * ").explain` to show execution plans. Also, you can kill jobs from webUI. // maropu On Thu, Aug 4, 2016 at 4:58 PM, Marco Colombo wrote: > Hi all, I've a question on how hive+spark are handling data. > > I've started a

Re: SPARKSQL with HiveContext My job fails

2016-08-04 Thread Mich Talebzadeh
Well the error states Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit exceeded Cause: The detail message "GC overhead limit exceeded" indicates that the garbage collector is

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
Hi Ben Perhaps with this size cardinality it is worth looking at feature hashing for your problem. Spark has the HashingTF transformer that works on a column of "sentences" (i.e. [string]). For categorical features you can hack it a little by converting your feature value into a

SPARKSQL with HiveContext My job fails

2016-08-04 Thread Vasu Devan
Hi Team, My Spark job fails with below error : Could you please advice me what is the problem with my job. Below is my error stack: 16/08/04 05:11:06 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-14] shutting down ActorSystem [sparkDriver]

pycharm and pyspark on windows

2016-08-04 Thread pseudo oduesp
Hi , what is good conf for pyspark and pycharm on windwos ? tahnks

Questions about ml.random forest (only one decision tree?)

2016-08-04 Thread 陈哲
Hi all I'm trying to use spark ml to do some prediction with random forest. By reading the example code https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaRandomForestClassifierExample.java , I can only find out it's similar to

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
I raised driver memory to 30G and maxresultsize to 25G, this time in pyspark. Code run: cat_int = ['bigfeature'] stagesIndex = [] stagesOhe = [] for c in cat_int: stagesIndex.append(StringIndexer(inputCol=c, outputCol="{}Index".format(c))) stagesOhe.append(OneHotEncoder(dropLast= False,

Spark SQL and number of task

2016-08-04 Thread Marco Colombo
Hi all, I've a question on how hive+spark are handling data. I've started a new HiveContext and I'm extracting data from cassandra. I've configured spark.sql.shuffle.partitions=10. Now, I've following query: select d.id, avg(d.avg) from v_points d where id=90 group by id; I see that 10 task are

Re: how to debug spark app?

2016-08-04 Thread Ben Teeuwen
Related question: what are good profiling tools other than watching along the application master with the running code? Are there things that can be logged during the run? If I have say 2 ways of accomplishing the same thing, and I want to learn about the time/memory/general resource blocking

Re: Stop Spark Streaming Jobs

2016-08-04 Thread Sandeep Nemuri
Also set spark.streaming.stopGracefullyOnShutdown to true If true, Spark shuts down the StreamingContext gracefully on JVM shutdown rather than immediately. http://spark.apache.org/docs/latest/configuration.html#spark-streaming ᐧ On Thu, Aug 4, 2016 at 12:31 PM, Sandeep Nemuri

Re: Stop Spark Streaming Jobs

2016-08-04 Thread Sandeep Nemuri
StreamingContext.stop(...) if using scala JavaStreamingContext.stop(...) if using Java ᐧ On Wed, Aug 3, 2016 at 9:14 PM, Tony Lane wrote: > SparkSession exposes stop() method > > On Wed, Aug 3, 2016 at 8:53 AM, Pradeep wrote: > >> Thanks Park. I

Explanation regarding Spark Streaming

2016-08-04 Thread Saurav Sinha
Hi, I have query Q1. What will happen if spark streaming job have batchDurationTime as 60 sec and processing time of complete pipeline is greater then 60 sec. -- Thanks and Regards, Saurav Sinha Contact: 9742879062

How to connect Power BI to Apache Spark on local machine?

2016-08-04 Thread Devi P.V
Hi all, I am newbie in Power BI.What are the configurations need to connect Power BI to spark on my local machine? I found some documents that mentioned spark over Azure's HDInsight .But didn't find any reference materials for connecting Spark to remote machine? Is it possible? following is the

Re: how to run local[k] threads on a single core

2016-08-04 Thread Sun Rui
I don’t think it possible as Spark does not support thread to CPU affinity. > On Aug 4, 2016, at 14:27, sujeet jog wrote: > > Is there a way we can run multiple tasks concurrently on a single core in > local mode. > > for ex :- i have 5 partition ~ 5 tasks, and only a

how to run local[k] threads on a single core

2016-08-04 Thread sujeet jog
Is there a way we can run multiple tasks concurrently on a single core in local mode. for ex :- i have 5 partition ~ 5 tasks, and only a single core , i want these tasks to run concurrently, and specifiy them to use /run on a single core. The machine itself is say 4 core, but i want to utilize

Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
Hi, spark 2.0 with mapr hadoop libraries was succesfully build using the following command: ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0-mapr-1602 -DskipTests clean package However when I then try to build a runnable distribution using the following command ./dev/make-distribution.sh