Re: Spark 2.x OFF_HEAP persistence

2017-01-04 Thread Vin J
Thanks for the reply Gene. Looks like this means, with Spark 2.x, one has to change from rdd.persist(StorageLevel.OFF_HEAP) to rdd.saveAsTextFile(alluxioPath) / rdd.saveAsObjectFile (alluxioPath) for guarantees like persisted rdd surviving a Spark JVM crash etc, as also the other benefits you

Re: Spark GraphFrame ConnectedComponents

2017-01-04 Thread Ankur Srivastava
This is the exact trace from the driver logs Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at

Re: L1 regularized Logistic regression ?

2017-01-04 Thread Yang
ah, found it, it's https://www.google.com/search?q=OWLQN thanks! On Wed, Jan 4, 2017 at 7:34 PM, J G wrote: > I haven't run this, but there is an elasticnetparam for Logistic > Regression here: https://spark.apache.org/docs/2.0.2/ml- >

Re: Spark GraphFrame ConnectedComponents

2017-01-04 Thread Ankur Srivastava
Hi I am rerunning the pipeline to generate the exact trace, I have below part of trace from last run: Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n://, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at

Re: L1 regularized Logistic regression ?

2017-01-04 Thread J G
I haven't run this, but there is an elasticnetparam for Logistic Regression here: https://spark.apache.org/docs/2.0.2/ml-classification-regression.html#logistic-regression You'd set elasticnetparam = 1 for Lasso On Wed, Jan 4, 2017 at 7:13 PM, Yang wrote: > does mllib

Re: Spark GraphFrame ConnectedComponents

2017-01-04 Thread Felix Cheung
Do you have more of the exception stack? From: Ankur Srivastava Sent: Wednesday, January 4, 2017 4:40:02 PM To: user@spark.apache.org Subject: Spark GraphFrame ConnectedComponents Hi, I am trying to use the ConnectedComponent

Spark GraphFrame ConnectedComponents

2017-01-04 Thread Ankur Srivastava
Hi, I am trying to use the ConnectedComponent algorithm of GraphFrames but by default it needs a checkpoint directory. As I am running my spark cluster with S3 as the DFS and do not have access to HDFS file system I tried using a s3 directory as checkpoint directory but I run into below

L1 regularized Logistic regression ?

2017-01-04 Thread Yang
does mllib support this? I do see Lasso impl here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala if it supports LR , could you please show me a link? what algorithm does it use? thanks

RE: Best way to process lookup ETL with Dataframes

2017-01-04 Thread Sesterhenn, Mike
Thanks a lot Nicholas. RE: Upgrading, I was afraid someone would suggest that. ☺ Yes we have an upgrade planned, but due to politics, we have to finish this first round of ETL before we can do the upgrade. I can’t confirm for sure that this issue would be fixed in Spark >= 1.6 without doing

Re: Spark Aggregator for array of doubles

2017-01-04 Thread Anton Okolnychyi
Hi, take a look at this pull request that is not merged yet: https://github.com/apache/spark/pull/16329 . It contains examples in Java and Scala that can be helpful. Best regards, Anton Okolnychyi On Jan 4, 2017 23:23, "Anil Langote" wrote: > Hi All, > > I have been

Spark Aggregator for array of doubles

2017-01-04 Thread Anil Langote
Hi All, I have been working on a use case where I have a DF which has 25 columns, 24 columns are of type string and last column is array of doubles. For a given set of columns I have to apply group by and add the array of doubles, I have implemented UDAF which works fine but it's expensive in

IBM Fluid query versus Spark

2017-01-04 Thread Mich Talebzadeh
Hi, Has anyone had any experience of using IBM Fluid query and comparing it with Spark with its MPP and in-memory capabilities? Thanks, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

[Spark GraphX] Graph Aggregation

2017-01-04 Thread Will Swank
Hi All - I'm new to Spark and GraphX and I'm trying to perform a simple sum operation for a graph. I have posted this question to StackOverflow and also on the gitter channel to no avail. I'm wondering if someone can help me out. The StackOverflow link is here:

Re: Spark 2.x OFF_HEAP persistence

2017-01-04 Thread Gene Pang
Hi Vin, >From Spark 2.x, OFF_HEAP was changed to no longer directly interface with an external block store. The previous tight dependency was restrictive and reduced flexibility. It looks like the new version uses the executor's off heap memory to allocate direct byte buffers, and does not

Re: Approach: Incremental data load from HBASE

2017-01-04 Thread ayan guha
Hi Chetan What do you mean by incremental load from HBase? There is a timestamp marker for each cell, but not at Row level. On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri wrote: > Ted Yu, > > You understood wrong, i said Incremental load from HBase to Hive, >

Re: Difference in R and Spark Output

2017-01-04 Thread Satya Varaprasad Allumallu
Looks like default algorithm used by R in kmeans function is Hartigan-Wong whereas Spark seems to be using Lloyd's algorithm. Can you rerun your kmeans R code using algorithm = "Lloyd" and see if the results match? On Tue, Jan 3, 2017 at 12:18 AM, Saroj C wrote: > Thanks

Re: Dynamic Allocation not respecting spark.executor.cores

2017-01-04 Thread Nirav Patel
If this is not an expected behavior then its should be logged as an issue. On Tue, Jan 3, 2017 at 2:51 PM, Nirav Patel wrote: > When enabling dynamic scheduling I see that all executors are using only 1 > core even if I specify "spark.executor.cores" to 6. If dynamic

Converting an InternalRow to a Row

2017-01-04 Thread Andy Dang
Hi all, (cc-ing dev since I've hit some developer API corner) What's the best way to convert an InternalRow to a Row if I've got an InternalRow and the corresponding Schema. Code snippet: @Test public void foo() throws Exception { Row row = RowFactory.create(1);

Spark 2.x OFF_HEAP persistence

2017-01-04 Thread Vin J
Until Spark 1.6 I see there were specific properties to configure such as the external block store master url (spark.externalBlockStore.url) etc to use OFF_HEAP storage level which made it clear that an external Tachyon type of block store as required/used for OFF_HEAP storage. Can someone

Re: (send this email to subscribe)

2017-01-04 Thread Dinko Srkoč
You can run Spark app on Dataproc, which is Google's managed Spark and Hadoop service: https://cloud.google.com/dataproc/docs/ basically, you: * assemble a jar * create a cluster * submit a job to that cluster (with the jar) * delete a cluster when the job is done Before all that, one has to

Re: top-k function for Window

2017-01-04 Thread Georg Heiler
What about https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF Koert Kuipers schrieb am Mi. 4. Jan. 2017 um 16:11: > i assumed topk of frequencies in one pass. if its topk by known > sorting/ordering then use priority queue

Re: spark sql in Cloudera package

2017-01-04 Thread Sean Owen
(You can post this on the CDH lists BTW as it's more about that distribution.) The whole thrift server isn't supported / enabled in CDH, so I think that's why the script isn't turned on either. I don't think it's as much about using Impala as not wanting to do all the grunt work to make it

spark sql in Cloudera package

2017-01-04 Thread Mich Talebzadeh
Sounds like Cloudera do not supply the shell for spark-sql but only spark-shell is that correct? I appreciate that one can use spark-shell. however, sounds like spark-sql is excluded in favour of Impala? cheers Dr Mich Talebzadeh LinkedIn *

Re: top-k function for Window

2017-01-04 Thread Koert Kuipers
i assumed topk of frequencies in one pass. if its topk by known sorting/ordering then use priority queue aggregator instead of spacesaver. On Tue, Jan 3, 2017 at 3:11 PM, Koert Kuipers wrote: > i dont know anything about windowing or about not using developer apis... > > but

Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread darren
We've been able to use ipopo dependency injection framework in our pyspark system and deploy .egg pyspark apps that resolve and wire up all the components (like a kernel architecture. Also similar to spring) during an initial bootstrap sequence; then invoke those components across spark. Just

Re: Issue with SparkR setup on RStudio

2017-01-04 Thread Md. Rezaul Karim
Cheung, The problem has been solved after switching from Windows to Linux environment. Thanks. Regards, _ *Md. Rezaul Karim* BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway IDA Business Park, Dangan, Galway,

Initial job has not accepted any resources

2017-01-04 Thread Igor Berman
Hi All, need your advice: we see in some very rare cases following error in log Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources and in spark UI there are idle workers and application in WAITING state in json

Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Jiří Syrový
Hi, another nice approach is to use instead of it Reader monad and some framework to support this approach (e.g. Grafter - https://github.com/zalando/grafter). It's lightweight and helps a bit with dependencies issues. 2016-12-28 22:55 GMT+01:00 Lars Albertsson : > Do you

Re: Approach: Incremental data load from HBASE

2017-01-04 Thread Chetan Khatri
Ted Yu, You understood wrong, i said Incremental load from HBase to Hive, individually you can say Incremental Import from HBase. On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote: > Incremental load traditionally means generating hfiles and > using

Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Chetan Khatri
Lars, Thank you, I want to use DI for configuring all the properties (wiring) for below architectural approach. Oracle -> Kafka Batch (Event Queuing) -> Spark Jobs( Incremental load from HBase -> Hive with Transformation) -> Spark Transformation -> PostgreSQL Thanks. On Thu, Dec 29, 2016 at

Re: Error: PartitioningCollection requires all of its partitionings have the same numPartitions.

2017-01-04 Thread mhornbech
I am also experiencing this. Do you have a JIRA on it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-PartitioningCollection-requires-all-of-its-partitionings-have-the-same-numPartitions-tp27875p28272.html Sent from the Apache Spark User List mailing

Re: Apache Hive with Spark Configuration

2017-01-04 Thread Chetan Khatri
Ryan, I agree that Hive 1.2.1 work reliably with Spark 2.x , but i went through with current stable version of Hive which is 2.0.1 and I am working with that. seems good but i want to make sure the which version of Hive is more reliable with Spark 2.x and i think @Ryan you replied the same which

Re: [External] Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-04 Thread Ben Teeuwen
Another option: https://github.com/mysql-time-machine/replicator >From the readme: "Replicates data changes from MySQL binlog to HBase or Kafka. In case of HBase, preserves the previous data versions. HBase storage is intended for auditing purposes of historical data. In addition, special