Cannot Import Package (spark-csv)

2015-08-02 Thread billchambers
I am trying to import the spark csv package while using the scala spark shell. Spark 1.4.1, Scala 2.11 I am starting the shell with: bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars ../sjars/spark-csv_2.11-1.1.0.jar --master local I then try and run and get the

Re: Cannot Import Package (spark-csv)

2015-08-02 Thread Ted Yu
The command you ran and the error you got were not visible. Mind sending them again ? Cheers On Sun, Aug 2, 2015 at 8:33 PM, billchambers wchamb...@ischool.berkeley.edu wrote: I am trying to import the spark csv package while using the scala spark shell. Spark 1.4.1, Scala 2.11 I am

Re: Cannot Import Package (spark-csv)

2015-08-02 Thread billchambers
Commands again are: Sure the commands are: scala val df = sqlContext.read.format(com.databricks.spark.csv).option(header, true).load(cars.csv) and get the following error: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at

Re: Cannot Import Package (spark-csv)

2015-08-02 Thread Ted Yu
I tried the following command on master branch: bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --jars ../spark-csv_2.10-1.0.3.jar --master local I didn't reproduce the error with your command. FYI On Sun, Aug 2, 2015 at 8:57 PM, Bill Chambers wchamb...@ischool.berkeley.edu

Checkpoint file not found

2015-08-02 Thread Anand Nalya
Hi, I'm writing a Streaming application in Spark 1.3. After running for some time, I'm getting following execption. I'm sure, that no other process is modifying the hdfs file. Any idea, what might be the cause of this? 15/08/02 21:24:13 ERROR scheduler.DAGSchedulerEventProcessLoop:

Re: Spark Number of Partitions Recommendations

2015-08-02 Thread Понькин Алексей
Yes, I forgot to mention I chose prime number as a modulo for hash function because my keys are usually strings and spark calculates particular partitiion using key hash(see HashPartitioner.scala) So, to avoid big number of collisions(when many keys located in few partition) it is common to use

Re: About memory leak in spark 1.4.1

2015-08-02 Thread Barak Gitsis
Hi, reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get filled because it is reserved.. My reasoning is: I give executor all the memory i can give it, so that makes it a boundary. From here i try to make the best use of memory I can. storage.memoryFraction is in a sense

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Akhil Das
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA to add this feature and may be in future release it could be added. Thanks Best Regards On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly moreill...@qub.ac.uk wrote: Hi, I am currently working on the latest version of

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
Hi, Barak It is ok with spark 1.3.0, the problem is with spark 1.4.1. I don't think spark.storage.memoryFraction will make any sense, because it is still in heap memory. -- -- ??: Barak Gitsis;bar...@similarweb.com; :

Re: Does Spark Streaming need to list all the files in a directory?

2015-08-02 Thread Akhil Das
I guess it goes through that 500k files https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193for the first time and then use a filter from next time. Thanks Best Regards On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das

Re: unsubscribe

2015-08-02 Thread Akhil Das
​LOL Brandon! @ziqiu See http://spark.apache.org/community.html You need to send an email to user-unsubscr...@spark.apache.org​ Thanks Best Regards On Fri, Jul 31, 2015 at 2:06 AM, Brandon White bwwintheho...@gmail.com wrote: https://www.youtube.com/watch?v=JncgoPKklVE On Thu, Jul 30, 2015

Re: About memory leak in spark 1.4.1

2015-08-02 Thread Barak Gitsis
spark uses a lot more than heap memory, it is the expected behavior. in 1.4 off-heap memory usage is supposed to grow in comparison to 1.3 Better use as little memory as you can for heap, and since you are not utilizing it already, it is safe for you to reduce it. memoryFraction helps you

spark no output

2015-08-02 Thread Pa Rö
hi community, i have run my k-means spark application on 1million data points. the program works, but no output in the hdfs is generated. when it runs on 10.000 points, a output is written. maybe someone has an idea? best regards, paul

Re: spark no output

2015-08-02 Thread Ted Yu
Can you provide some more detai: release of Spark you're using were you running in standalone or YARN cluster mode have you checked driver log ? Cheers On Sun, Aug 2, 2015 at 7:04 AM, Pa Rö paul.roewer1...@googlemail.com wrote: hi community, i have run my k-means spark application on

Re: spark no output

2015-08-02 Thread Connor Zanin
I agree with Ted. Could you please post the log file? On Aug 2, 2015 10:13 AM, Ted Yu yuzhih...@gmail.com wrote: Can you provide some more detai: release of Spark you're using were you running in standalone or YARN cluster mode have you checked driver log ? Cheers On Sun, Aug 2, 2015 at

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Jörn Franke
I think you use case can already be implemented with HDFS encryption and/or SealedObject, if you look for sth like Altibase. If you create a JIRA you may want to set the bar a little bit higher and propose sth like MIT cryptdb: https://css.csail.mit.edu/cryptdb/ Le ven. 31 juil. 2015 à 10:17,

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
spark.storage.memoryFraction is in heap memory, but my situation is that the memory is more than heap memory ! Anyone else use spark 1.4.1 in production? -- -- ??: Ted Yu;yuzhih...@gmail.com; : 2015??8??2??(??) 5:45 ??:

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote: No one has any ideas? Is there some more information I should provide? I am

Re: TCP/IP speedup

2015-08-02 Thread Michael Segel
This may seem like a silly question… but in following Mark’s link, the presentation talks about the TPC-DS benchmark. Here’s my question… what benchmark results? If you go over to the TPC.org http://tpc.org/ website they have no TPC-DS benchmarks listed. (Either audited or unaudited) So

how to ignore MatchError then processing a large json file in spark-sql

2015-08-02 Thread fuellee lee
I'm trying to process a bunch of large json log files with spark, but it fails every time with `scala.MatchError`, Whether I give it schema or not. I just want to skip lines that does not match schema, but I can't find how in docs of spark. I know write a json parser and map it to json file RDD

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
No one has any ideas? Is there some more information I should provide? I am looking for ways to increase the parallelism among workers. Currently I just see number of simultaneous connections to Solr equal to the number of workers. My number of partitions is (2.5x) larger than number of workers,

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
so how many cores you configure per node? do u have something like total-executor-cores or maybe --num-executors config(I'm not sure what kind of cluster databricks platform provides, if it's standalone then first option should be used)? if you have 4 cores at total, then even though you have

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
Hi Igor, The cluster is a Databricks Spark cluster. It consists of 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The original mail has some more details (also the reference to the HttpSolrClient in there should be HttpSolrServer, sorry about that, mistake while writing the email).

RE: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Silvio Fiorito
Can you share the transformations up to the foreachPartition? From: Sujit Palmailto:sujitatgt...@gmail.com Sent: ‎8/‎2/‎2015 4:42 PM To: Igor Bermanmailto:igor.ber...@gmail.com Cc: usermailto:user@spark.apache.org Subject: Re: How to increase parallelism of a Spark

Extremely poor predictive performance with RF in mllib

2015-08-02 Thread pkphlam
Hi, This might be a long shot, but has anybody run into very poor predictive performance using RandomForest with Mllib? Here is what I'm doing: - Spark 1.4.1 with PySpark - Python 3.4.2 - ~30,000 Tweets of text - 12289 1s and 15956 0s - Whitespace tokenization and then hashing trick for feature

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Abhishek R. Singh
I don't know if (your assertion/expectation that) workers will process things (multiple partitions) in parallel is really valid. Or if having more partitions than workers will necessarily help (unless you are memory bound - so partitions is essentially helping your work size rather than

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
spark uses a lot more than heap memory, it is the expected behavior. It didn't exist in spark 1.3.x What does a lot more than means? It means that I lose control of it! I try to apply 31g, but it still grows to 55g and continues to grow!!! That is the point! I have tried set memoryFraction to

Re: spark cluster setup

2015-08-02 Thread Sonal Goyal
What do the master logs show? Best Regards, Sonal Founder, Nube Technologies http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=http%3A%2F%2Fwww.nubetech.co%2Fsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1 Check out

Re: TCP/IP speedup

2015-08-02 Thread Steve Loughran
On 1 Aug 2015, at 18:26, Ruslan Dautkhanov dautkha...@gmail.commailto:dautkha...@gmail.com wrote: If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000) may increase bandwidth up to ~20%.

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Steve Loughran
On 2 Aug 2015, at 13:42, Sujit Pal sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote: There is no additional configuration on the external Solr host from my code, I am using the default HttpClient provided by HttpSolrServer. According to the Javadocs, you can pass in a HttpClient