Cannot Import Package (spark-csv)
I am trying to import the spark csv package while using the scala spark shell. Spark 1.4.1, Scala 2.11 I am starting the shell with: bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars ../sjars/spark-csv_2.11-1.1.0.jar --master local I then try and run and get the following error: What am i doing wrong? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Cannot Import Package (spark-csv)
The command you ran and the error you got were not visible. Mind sending them again ? Cheers On Sun, Aug 2, 2015 at 8:33 PM, billchambers wchamb...@ischool.berkeley.edu wrote: I am trying to import the spark csv package while using the scala spark shell. Spark 1.4.1, Scala 2.11 I am starting the shell with: bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars ../sjars/spark-csv_2.11-1.1.0.jar --master local I then try and run and get the following error: What am i doing wrong? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Cannot Import Package (spark-csv)
Commands again are: Sure the commands are: scala val df = sqlContext.read.format(com.databricks.spark.csv).option(header, true).load(cars.csv) and get the following error: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) ... 49 elided -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109p24110.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Cannot Import Package (spark-csv)
I tried the following command on master branch: bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --jars ../spark-csv_2.10-1.0.3.jar --master local I didn't reproduce the error with your command. FYI On Sun, Aug 2, 2015 at 8:57 PM, Bill Chambers wchamb...@ischool.berkeley.edu wrote: Sure the commands are: scala val df = sqlContext.read.format(com.databricks.spark.csv).option(header, true).load(cars.csv) and get the following error: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) ... 49 elided On Sun, Aug 2, 2015 at 8:56 PM, Ted Yu yuzhih...@gmail.com wrote: The command you ran and the error you got were not visible. Mind sending them again ? Cheers On Sun, Aug 2, 2015 at 8:33 PM, billchambers wchamb...@ischool.berkeley.edu wrote: I am trying to import the spark csv package while using the scala spark shell. Spark 1.4.1, Scala 2.11 I am starting the shell with: bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars ../sjars/spark-csv_2.11-1.1.0.jar --master local I then try and run and get the following error: What am i doing wrong? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Bill Chambers http://billchambers.me/ Email wchamb...@ischool.berkeley.edu | LinkedIn http://linkedin.com/in/wachambers | Twitter https://twitter.com/b_a_chambers | Github https://github.com/anabranch
Checkpoint file not found
Hi, I'm writing a Streaming application in Spark 1.3. After running for some time, I'm getting following execption. I'm sure, that no other process is modifying the hdfs file. Any idea, what might be the cause of this? 15/08/02 21:24:13 ERROR scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.io.FileNotFoundException: File does not exist: hdfs://node16:8020/user/anandnalya/tiered-original/e6794c2c-1c9f-414a-ae7e-e58a8f874661/rdd-5112/part-0 at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1132) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1124) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1124) at org.apache.spark.rdd.CheckpointRDD.getPreferredLocations(CheckpointRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:230) at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:230) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1324) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1331) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1331) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1331) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1331) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1331) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1331) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333) at
Re: Spark Number of Partitions Recommendations
Yes, I forgot to mention I chose prime number as a modulo for hash function because my keys are usually strings and spark calculates particular partitiion using key hash(see HashPartitioner.scala) So, to avoid big number of collisions(when many keys located in few partition) it is common to use prime number in modulo. But it makes sense only for String keys offcourse, because of hash function. If yuo have different hash function for key of different type you can use any other modulo instead prime number. I like this discussion on this topic http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1t=1 02.08.2015, 00:14, Ruslan Dautkhanov dautkha...@gmail.com: You should also take into account amount of memory that you plan to use. It's advised not to give too much memory for each executor .. otherwise GC overhead will go up. Btw, why prime numbers? -- Ruslan Dautkhanov On Wed, Jul 29, 2015 at 3:31 AM, ponkin alexey.pon...@ya.ru wrote: Hi Rahul, Where did you see such a recommendation? I personally define partitions with the following formula partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) ) where nextPrimeNumberAbove(x) - prime number which is greater than x K - multiplicator to calculate start with 1 and encrease untill join perfomance start to degrade -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: About memory leak in spark 1.4.1
Hi, reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get filled because it is reserved.. My reasoning is: I give executor all the memory i can give it, so that makes it a boundary. From here i try to make the best use of memory I can. storage.memoryFraction is in a sense user data space. The rest can be used by the system. If you don't have so much data that you MUST store in memory for performance, better give spark more space.. ended up setting it to 0.3 All that said, it is on spark 1.3 on cluster hope that helps On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote: Hi, all I upgrage spark to 1.4.1, many applications failed... I find the heap memory is not full , but the process of CoarseGrainedExecutorBackend will take more memory than I expect, and it will increase as time goes on, finally more than max limited of the server, the worker will die. Any can help? Mode:standalone spark.executor.memory 50g 25583 xiaoju20 0 75.5g 55g 28m S 1729.3 88.1 2172:52 java 55g more than 50g I apply -- *-Barak*
Re: Encryption on RDDs or in-memory/cache on Apache Spark
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA to add this feature and may be in future release it could be added. Thanks Best Regards On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly moreill...@qub.ac.uk wrote: Hi, I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache (something similar is Altibase's HDB: http://altibase.com/in-memory-database-computing-solutions/security/) when running applications in Spark? Or is there an external library/framework which could be used to encrypt RDDs or in-memory/cache in Spark? I discovered it is possible to encrypt the data, and encapsulate it into RDD. However, I feel this affects Spark's fast data processing as it is slower to encrypt the data, and then encapsulate it to RDD; it's then a two step process. Encryption and storing data should be done parallel. Any help would be appreciated. Many thanks, Matthew - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re?? About memory leak in spark 1.4.1
Hi, Barak It is ok with spark 1.3.0, the problem is with spark 1.4.1. I don't think spark.storage.memoryFraction will make any sense, because it is still in heap memory. -- -- ??: Barak Gitsis;bar...@similarweb.com; : 2015??8??2??(??) 4:11 ??: Sea261810...@qq.com; useruser@spark.apache.org; : rxinr...@databricks.com; joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; : Re: About memory leak in spark 1.4.1 Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get filled because it is reserved.. My reasoning is: I give executor all the memory i can give it, so that makes it a boundary. From here i try to make the best use of memory I can. storage.memoryFraction is in a sense user data space. The rest can be used by the system. If you don't have so much data that you MUST store in memory for performance, better give spark more space.. ended up setting it to 0.3 All that said, it is on spark 1.3 on cluster hope that helps On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote: Hi, all I upgrage spark to 1.4.1, many applications failed... I find the heap memory is not full , but the process of CoarseGrainedExecutorBackend will take more memory than I expect, and it will increase as time goes on, finally more than max limited of the server, the worker will die. Any can help?? Mode??standalone spark.executor.memory 50g 25583 xiaoju20 0 75.5g 55g 28m S 1729.3 88.1 2172:52 java 55g more than 50g I apply -- -Barak
Re: Does Spark Streaming need to list all the files in a directory?
I guess it goes through that 500k files https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193for the first time and then use a filter from next time. Thanks Best Regards On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das t...@databricks.com wrote: For the first time it needs to list them. AFter that the list should be cached by the file stream implementation (as far as I remember). On Thu, Jul 30, 2015 at 3:55 PM, Brandon White bwwintheho...@gmail.com wrote: Is this a known bottle neck for Spark Streaming textFileStream? Does it need to list all the current files in a directory before he gets the new files? Say I have 500k files in a directory, does it list them all in order to get the new files?
Re: unsubscribe
LOL Brandon! @ziqiu See http://spark.apache.org/community.html You need to send an email to user-unsubscr...@spark.apache.org Thanks Best Regards On Fri, Jul 31, 2015 at 2:06 AM, Brandon White bwwintheho...@gmail.com wrote: https://www.youtube.com/watch?v=JncgoPKklVE On Thu, Jul 30, 2015 at 1:30 PM, ziqiu...@accenture.com wrote: -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com
Re: About memory leak in spark 1.4.1
spark uses a lot more than heap memory, it is the expected behavior. in 1.4 off-heap memory usage is supposed to grow in comparison to 1.3 Better use as little memory as you can for heap, and since you are not utilizing it already, it is safe for you to reduce it. memoryFraction helps you optimize heap usage for your data/application profile while keeping it tight. On Sun, Aug 2, 2015 at 12:54 PM Sea 261810...@qq.com wrote: spark.storage.memoryFraction is in heap memory, but my situation is that the memory is more than heap memory ! Anyone else use spark 1.4.1 in production? -- 原始邮件 -- *发件人:* Ted Yu;yuzhih...@gmail.com; *发送时间:* 2015年8月2日(星期天) 下午5:45 *收件人:* Sea261810...@qq.com; *抄送:* Barak Gitsisbar...@similarweb.com; user@spark.apache.org user@spark.apache.org; rxinr...@databricks.com; joshrosen joshro...@databricks.com; daviesdav...@databricks.com; *主题:* Re: About memory leak in spark 1.4.1 http://spark.apache.org/docs/latest/tuning.html does mention spark.storage.memoryFraction in two places. One is under Cache Size Tuning section. FYI On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote: Hi, Barak It is ok with spark 1.3.0, the problem is with spark 1.4.1. I don't think spark.storage.memoryFraction will make any sense, because it is still in heap memory. -- 原始邮件 -- *发件人:* Barak Gitsis;bar...@similarweb.com; *发送时间:* 2015年8月2日(星期天) 下午4:11 *收件人:* Sea261810...@qq.com; useruser@spark.apache.org; *抄送:* rxinr...@databricks.com; joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; *主题:* Re: About memory leak in spark 1.4.1 Hi, reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get filled because it is reserved.. My reasoning is: I give executor all the memory i can give it, so that makes it a boundary . From here i try to make the best use of memory I can. storage.memoryFraction is in a sense user data space. The rest can be used by the system. If you don't have so much data that you MUST store in memory for performance, better give spark more space.. ended up setting it to 0.3 All that said, it is on spark 1.3 on cluster hope that helps On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote: Hi, all I upgrage spark to 1.4.1, many applications failed... I find the heap memory is not full , but the process of CoarseGrainedExecutorBackend will take more memory than I expect, and it will increase as time goes on, finally more than max limited of the server, the worker will die. Any can help? Mode:standalone spark.executor.memory 50g 25583 xiaoju20 0 75.5g 55g 28m S 1729.3 88.1 2172:52 java 55g more than 50g I apply -- *-Barak* -- *-Barak*
spark no output
hi community, i have run my k-means spark application on 1million data points. the program works, but no output in the hdfs is generated. when it runs on 10.000 points, a output is written. maybe someone has an idea? best regards, paul
Re: spark no output
Can you provide some more detai: release of Spark you're using were you running in standalone or YARN cluster mode have you checked driver log ? Cheers On Sun, Aug 2, 2015 at 7:04 AM, Pa Rö paul.roewer1...@googlemail.com wrote: hi community, i have run my k-means spark application on 1million data points. the program works, but no output in the hdfs is generated. when it runs on 10.000 points, a output is written. maybe someone has an idea? best regards, paul
Re: spark no output
I agree with Ted. Could you please post the log file? On Aug 2, 2015 10:13 AM, Ted Yu yuzhih...@gmail.com wrote: Can you provide some more detai: release of Spark you're using were you running in standalone or YARN cluster mode have you checked driver log ? Cheers On Sun, Aug 2, 2015 at 7:04 AM, Pa Rö paul.roewer1...@googlemail.com wrote: hi community, i have run my k-means spark application on 1million data points. the program works, but no output in the hdfs is generated. when it runs on 10.000 points, a output is written. maybe someone has an idea? best regards, paul
Re: Encryption on RDDs or in-memory/cache on Apache Spark
I think you use case can already be implemented with HDFS encryption and/or SealedObject, if you look for sth like Altibase. If you create a JIRA you may want to set the bar a little bit higher and propose sth like MIT cryptdb: https://css.csail.mit.edu/cryptdb/ Le ven. 31 juil. 2015 à 10:17, Matthew O'Reilly moreill...@qub.ac.uk a écrit : Hi, I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache (something similar is Altibase's HDB: http://altibase.com/in-memory-database-computing-solutions/security/) when running applications in Spark? Or is there an external library/framework which could be used to encrypt RDDs or in-memory/cache in Spark? I discovered it is possible to encrypt the data, and encapsulate it into RDD. However, I feel this affects Spark's fast data processing as it is slower to encrypt the data, and then encapsulate it to RDD; it's then a two step process. Encryption and storing data should be done parallel. Any help would be appreciated. Many thanks, Matthew - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re?? About memory leak in spark 1.4.1
spark.storage.memoryFraction is in heap memory, but my situation is that the memory is more than heap memory ! Anyone else use spark 1.4.1 in production? -- -- ??: Ted Yu;yuzhih...@gmail.com; : 2015??8??2??(??) 5:45 ??: Sea261810...@qq.com; : Barak Gitsisbar...@similarweb.com; user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; : Re: About memory leak in spark 1.4.1 http://spark.apache.org/docs/latest/tuning.html does mention spark.storage.memoryFraction in two places. One is under Cache Size Tuning section. FYI On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote: Hi, Barak It is ok with spark 1.3.0, the problem is with spark 1.4.1. I don't think spark.storage.memoryFraction will make any sense, because it is still in heap memory. -- -- ??: Barak Gitsis;bar...@similarweb.com; : 2015??8??2??(??) 4:11 ??: Sea261810...@qq.com; useruser@spark.apache.org; : rxinr...@databricks.com; joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; : Re: About memory leak in spark 1.4.1 Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get filled because it is reserved.. My reasoning is: I give executor all the memory i can give it, so that makes it a boundary. From here i try to make the best use of memory I can. storage.memoryFraction is in a sense user data space. The rest can be used by the system. If you don't have so much data that you MUST store in memory for performance, better give spark more space.. ended up setting it to 0.3 All that said, it is on spark 1.3 on cluster hope that helps On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote: Hi, all I upgrage spark to 1.4.1, many applications failed... I find the heap memory is not full , but the process of CoarseGrainedExecutorBackend will take more memory than I expect, and it will increase as time goes on, finally more than max limited of the server, the worker will die. Any can help?? Mode??standalone spark.executor.memory 50g 25583 xiaoju20 0 75.5g 55g 28m S 1729.3 88.1 2172:52 java 55g more than 50g I apply -- -Barak
Re: How to increase parallelism of a Spark cluster?
What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote: No one has any ideas? Is there some more information I should provide? I am looking for ways to increase the parallelism among workers. Currently I just see number of simultaneous connections to Solr equal to the number of workers. My number of partitions is (2.5x) larger than number of workers, and the workers seem to be large enough to handle more than one task at a time. I am creating a single client per partition in my mapPartition call. Not sure if that is creating the gating situation? Perhaps I should use a Pool of clients instead? Would really appreciate some pointers. Thanks in advance for any help you can provide. -sujit On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote: Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def getResults(keyValues: Iterator[(String, Array[String])]): Iterator[(String, String)] = { val solr = new HttpSolrClient() initializeSolrParameters(solr) keyValues.map(keyValue = (keyValue._1, process(solr, keyValue))) } myRDD.repartition(10) .mapPartitions(keyValues = getResults(keyValues)) The mapPartitions does some initialization to the SolrJ client per partition and then hits it for each record in the partition via the getResults() call. I repartitioned in the hope that this will result in 10 clients hitting Solr simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I can). However, I counted the number of open connections using netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and observed that Solr has a constant 4 clients (ie, equal to the number of workers) over the lifetime of the run. My observation leads me to believe that each worker processes a single stream of work sequentially. However, from what I understand about how Spark works, each worker should be able to process number of tasks parallelly, and that repartition() is a hint for it to do so. Is there some SparkConf environment variable I should set to increase parallelism in these workers, or should I just configure a cluster with multiple workers per machine? Or is there something I am doing wrong? Thank you in advance for any pointers you can provide. -sujit
Re: TCP/IP speedup
This may seem like a silly question… but in following Mark’s link, the presentation talks about the TPC-DS benchmark. Here’s my question… what benchmark results? If you go over to the TPC.org http://tpc.org/ website they have no TPC-DS benchmarks listed. (Either audited or unaudited) So what gives? Note: There are TPCx-HS benchmarks listed… Thx -Mike On Aug 1, 2015, at 5:45 PM, Mark Hamstra m...@clearstorydata.com wrote: https://spark-summit.org/2015/events/making-sense-of-spark-performance/ https://spark-summit.org/2015/events/making-sense-of-spark-performance/ On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus edel...@gmail.com mailto:edel...@gmail.com wrote: Hi All! How important would be a significant performance improvement to TCP/IP itself, in terms of overall job performance improvement. Which part would be most significantly accelerated? Would it be HDFS? -- ttfn Simon Edelhaus California 2015
how to ignore MatchError then processing a large json file in spark-sql
I'm trying to process a bunch of large json log files with spark, but it fails every time with `scala.MatchError`, Whether I give it schema or not. I just want to skip lines that does not match schema, but I can't find how in docs of spark. I know write a json parser and map it to json file RDD can get things done, but I want to use `sqlContext.read.schema(schema).json(fileNames).selectExpr(...)` because it's much easier to maintain. thanks
Re: How to increase parallelism of a Spark cluster?
No one has any ideas? Is there some more information I should provide? I am looking for ways to increase the parallelism among workers. Currently I just see number of simultaneous connections to Solr equal to the number of workers. My number of partitions is (2.5x) larger than number of workers, and the workers seem to be large enough to handle more than one task at a time. I am creating a single client per partition in my mapPartition call. Not sure if that is creating the gating situation? Perhaps I should use a Pool of clients instead? Would really appreciate some pointers. Thanks in advance for any help you can provide. -sujit On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote: Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def getResults(keyValues: Iterator[(String, Array[String])]): Iterator[(String, String)] = { val solr = new HttpSolrClient() initializeSolrParameters(solr) keyValues.map(keyValue = (keyValue._1, process(solr, keyValue))) } myRDD.repartition(10) .mapPartitions(keyValues = getResults(keyValues)) The mapPartitions does some initialization to the SolrJ client per partition and then hits it for each record in the partition via the getResults() call. I repartitioned in the hope that this will result in 10 clients hitting Solr simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I can). However, I counted the number of open connections using netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and observed that Solr has a constant 4 clients (ie, equal to the number of workers) over the lifetime of the run. My observation leads me to believe that each worker processes a single stream of work sequentially. However, from what I understand about how Spark works, each worker should be able to process number of tasks parallelly, and that repartition() is a hint for it to do so. Is there some SparkConf environment variable I should set to increase parallelism in these workers, or should I just configure a cluster with multiple workers per machine? Or is there something I am doing wrong? Thank you in advance for any pointers you can provide. -sujit
Re: How to increase parallelism of a Spark cluster?
so how many cores you configure per node? do u have something like total-executor-cores or maybe --num-executors config(I'm not sure what kind of cluster databricks platform provides, if it's standalone then first option should be used)? if you have 4 cores at total, then even though you have 4 cores per machine only 1 is working on each machine...which could be a cause. another option - you are hitting some default config of limiting number of concurrent routes or max total connection from jvm, look at https://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html (assuming you are using HttpClient from 4.x and not 3.x version) not sure what are the defaults... On 2 August 2015 at 23:42, Sujit Pal sujitatgt...@gmail.com wrote: Hi Igor, The cluster is a Databricks Spark cluster. It consists of 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The original mail has some more details (also the reference to the HttpSolrClient in there should be HttpSolrServer, sorry about that, mistake while writing the email). There is no additional configuration on the external Solr host from my code, I am using the default HttpClient provided by HttpSolrServer. According to the Javadocs, you can pass in a HttpClient object as well. Is there some specific configuration you would suggest to get past any limits? On another project, I faced a similar problem but I had more leeway (was using a Spark cluster from EC2) and less time, my workaround was to use python multiprocessing to create a program that started up 30 python JSON/HTTP clients and wrote output into 30 output files, which were then processed by Spark. Reason I mention this is that I was using default configurations there as well, just needed to increase the number of connections against Solr to a higher number. This time round, I would like to do this through Spark because it makes the pipeline less complex. -sujit On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman igor.ber...@gmail.com wrote: What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote: No one has any ideas? Is there some more information I should provide? I am looking for ways to increase the parallelism among workers. Currently I just see number of simultaneous connections to Solr equal to the number of workers. My number of partitions is (2.5x) larger than number of workers, and the workers seem to be large enough to handle more than one task at a time. I am creating a single client per partition in my mapPartition call. Not sure if that is creating the gating situation? Perhaps I should use a Pool of clients instead? Would really appreciate some pointers. Thanks in advance for any help you can provide. -sujit On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote: Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def getResults(keyValues: Iterator[(String, Array[String])]): Iterator[(String, String)] = { val solr = new HttpSolrClient() initializeSolrParameters(solr) keyValues.map(keyValue = (keyValue._1, process(solr, keyValue))) } myRDD.repartition(10) .mapPartitions(keyValues = getResults(keyValues)) The mapPartitions does some initialization to the SolrJ client per partition and then hits it for each record in the partition via the getResults() call. I repartitioned in the hope that this will result in 10 clients hitting Solr simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I can). However, I counted the number of open connections using netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and observed that Solr has a constant 4 clients (ie, equal to the number of workers) over the lifetime of the run. My observation leads me to believe that each worker processes a single stream of work sequentially. However, from what I understand about how Spark works, each worker should be able to process number of tasks parallelly, and that repartition() is a hint for it to do so. Is there some SparkConf environment variable I should set to increase parallelism in these workers, or should I just configure a cluster with multiple workers per machine? Or is there something I am doing wrong? Thank you in advance for any pointers you can provide. -sujit
Re: How to increase parallelism of a Spark cluster?
Hi Igor, The cluster is a Databricks Spark cluster. It consists of 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The original mail has some more details (also the reference to the HttpSolrClient in there should be HttpSolrServer, sorry about that, mistake while writing the email). There is no additional configuration on the external Solr host from my code, I am using the default HttpClient provided by HttpSolrServer. According to the Javadocs, you can pass in a HttpClient object as well. Is there some specific configuration you would suggest to get past any limits? On another project, I faced a similar problem but I had more leeway (was using a Spark cluster from EC2) and less time, my workaround was to use python multiprocessing to create a program that started up 30 python JSON/HTTP clients and wrote output into 30 output files, which were then processed by Spark. Reason I mention this is that I was using default configurations there as well, just needed to increase the number of connections against Solr to a higher number. This time round, I would like to do this through Spark because it makes the pipeline less complex. -sujit On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman igor.ber...@gmail.com wrote: What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote: No one has any ideas? Is there some more information I should provide? I am looking for ways to increase the parallelism among workers. Currently I just see number of simultaneous connections to Solr equal to the number of workers. My number of partitions is (2.5x) larger than number of workers, and the workers seem to be large enough to handle more than one task at a time. I am creating a single client per partition in my mapPartition call. Not sure if that is creating the gating situation? Perhaps I should use a Pool of clients instead? Would really appreciate some pointers. Thanks in advance for any help you can provide. -sujit On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote: Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def getResults(keyValues: Iterator[(String, Array[String])]): Iterator[(String, String)] = { val solr = new HttpSolrClient() initializeSolrParameters(solr) keyValues.map(keyValue = (keyValue._1, process(solr, keyValue))) } myRDD.repartition(10) .mapPartitions(keyValues = getResults(keyValues)) The mapPartitions does some initialization to the SolrJ client per partition and then hits it for each record in the partition via the getResults() call. I repartitioned in the hope that this will result in 10 clients hitting Solr simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I can). However, I counted the number of open connections using netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and observed that Solr has a constant 4 clients (ie, equal to the number of workers) over the lifetime of the run. My observation leads me to believe that each worker processes a single stream of work sequentially. However, from what I understand about how Spark works, each worker should be able to process number of tasks parallelly, and that repartition() is a hint for it to do so. Is there some SparkConf environment variable I should set to increase parallelism in these workers, or should I just configure a cluster with multiple workers per machine? Or is there something I am doing wrong? Thank you in advance for any pointers you can provide. -sujit
RE: How to increase parallelism of a Spark cluster?
Can you share the transformations up to the foreachPartition? From: Sujit Palmailto:sujitatgt...@gmail.com Sent: 8/2/2015 4:42 PM To: Igor Bermanmailto:igor.ber...@gmail.com Cc: usermailto:user@spark.apache.org Subject: Re: How to increase parallelism of a Spark cluster? Hi Igor, The cluster is a Databricks Spark cluster. It consists of 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The original mail has some more details (also the reference to the HttpSolrClient in there should be HttpSolrServer, sorry about that, mistake while writing the email). There is no additional configuration on the external Solr host from my code, I am using the default HttpClient provided by HttpSolrServer. According to the Javadocs, you can pass in a HttpClient object as well. Is there some specific configuration you would suggest to get past any limits? On another project, I faced a similar problem but I had more leeway (was using a Spark cluster from EC2) and less time, my workaround was to use python multiprocessing to create a program that started up 30 python JSON/HTTP clients and wrote output into 30 output files, which were then processed by Spark. Reason I mention this is that I was using default configurations there as well, just needed to increase the number of connections against Solr to a higher number. This time round, I would like to do this through Spark because it makes the pipeline less complex. -sujit On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman igor.ber...@gmail.commailto:igor.ber...@gmail.com wrote: What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote: No one has any ideas? Is there some more information I should provide? I am looking for ways to increase the parallelism among workers. Currently I just see number of simultaneous connections to Solr equal to the number of workers. My number of partitions is (2.5x) larger than number of workers, and the workers seem to be large enough to handle more than one task at a time. I am creating a single client per partition in my mapPartition call. Not sure if that is creating the gating situation? Perhaps I should use a Pool of clients instead? Would really appreciate some pointers. Thanks in advance for any help you can provide. -sujit On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote: Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def getResults(keyValues: Iterator[(String, Array[String])]): Iterator[(String, String)] = { val solr = new HttpSolrClient() initializeSolrParameters(solr) keyValues.map(keyValue = (keyValue._1, process(solr, keyValue))) } myRDD.repartition(10) .mapPartitions(keyValues = getResults(keyValues)) The mapPartitions does some initialization to the SolrJ client per partition and then hits it for each record in the partition via the getResults() call. I repartitioned in the hope that this will result in 10 clients hitting Solr simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I can). However, I counted the number of open connections using netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and observed that Solr has a constant 4 clients (ie, equal to the number of workers) over the lifetime of the run. My observation leads me to believe that each worker processes a single stream of work sequentially. However, from what I understand about how Spark works, each worker should be able to process number of tasks parallelly, and that repartition() is a hint for it to do so. Is there some SparkConf environment variable I should set to increase parallelism in these workers, or should I just configure a cluster with multiple workers per machine? Or is there something I am doing wrong? Thank you in advance for any pointers you can provide. -sujit
Extremely poor predictive performance with RF in mllib
Hi, This might be a long shot, but has anybody run into very poor predictive performance using RandomForest with Mllib? Here is what I'm doing: - Spark 1.4.1 with PySpark - Python 3.4.2 - ~30,000 Tweets of text - 12289 1s and 15956 0s - Whitespace tokenization and then hashing trick for feature selection using 10,000 features - Run RF with 100 trees and maxDepth of 4 and then predict using the features from all the 1s observations. So in theory, I should get predictions of close to 12289 1s (especially if the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous to me and makes me suspect something is wrong with my code or I'm missing something. I notice similar behavior (although not as extreme) if I play around with the settings. But I'm getting normal behavior with other classifiers, so I don't think it's my setup that's the problem. For example: lrm = LogisticRegressionWithSGD.train(lp, iterations=10) logit_predict = lrm.predict(predict_feat) logit_predict.sum() 9077 nb = NaiveBayes.train(lp) nb_predict = nb.predict(predict_feat) nb_predict.sum() 10287.0 rf = RandomForest.trainClassifier(lp, numClasses=2, categoricalFeaturesInfo={}, numTrees=100, seed=422) rf_predict = rf.predict(predict_feat) rf_predict.sum() 0.0 This code was all run back to back so I didn't change anything in between. Does anybody have a possible explanation for this? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to increase parallelism of a Spark cluster?
I don't know if (your assertion/expectation that) workers will process things (multiple partitions) in parallel is really valid. Or if having more partitions than workers will necessarily help (unless you are memory bound - so partitions is essentially helping your work size rather than execution parallelism). [Disclaimer: I am no authority on Spark, but wanted to throw my spin based my own understanding]. Nothing official about it :) -abhishek- On Jul 31, 2015, at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote: Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def getResults(keyValues: Iterator[(String, Array[String])]): Iterator[(String, String)] = { val solr = new HttpSolrClient() initializeSolrParameters(solr) keyValues.map(keyValue = (keyValue._1, process(solr, keyValue))) } myRDD.repartition(10) .mapPartitions(keyValues = getResults(keyValues)) The mapPartitions does some initialization to the SolrJ client per partition and then hits it for each record in the partition via the getResults() call. I repartitioned in the hope that this will result in 10 clients hitting Solr simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I can). However, I counted the number of open connections using netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and observed that Solr has a constant 4 clients (ie, equal to the number of workers) over the lifetime of the run. My observation leads me to believe that each worker processes a single stream of work sequentially. However, from what I understand about how Spark works, each worker should be able to process number of tasks parallelly, and that repartition() is a hint for it to do so. Is there some SparkConf environment variable I should set to increase parallelism in these workers, or should I just configure a cluster with multiple workers per machine? Or is there something I am doing wrong? Thank you in advance for any pointers you can provide. -sujit
Re?? About memory leak in spark 1.4.1
spark uses a lot more than heap memory, it is the expected behavior. It didn't exist in spark 1.3.x What does a lot more than means? It means that I lose control of it! I try to apply 31g, but it still grows to 55g and continues to grow!!! That is the point! I have tried set memoryFraction to 0.2??but it didn't help. I don't know whether it will still exist in the next release 1.5, I wish not. -- -- ??: Barak Gitsis;bar...@similarweb.com; : 2015??8??2??(??) 9:55 ??: Sea261810...@qq.com; Ted Yuyuzhih...@gmail.com; : user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; : Re: About memory leak in spark 1.4.1 spark uses a lot more than heap memory, it is the expected behavior.in 1.4 off-heap memory usage is supposed to grow in comparison to 1.3 Better use as little memory as you can for heap, and since you are not utilizing it already, it is safe for you to reduce it. memoryFraction helps you optimize heap usage for your data/application profile while keeping it tight. On Sun, Aug 2, 2015 at 12:54 PM Sea 261810...@qq.com wrote: spark.storage.memoryFraction is in heap memory, but my situation is that the memory is more than heap memory ! Anyone else use spark 1.4.1 in production? -- -- ??: Ted Yu;yuzhih...@gmail.com; : 2015??8??2??(??) 5:45 ??: Sea261810...@qq.com; : Barak Gitsisbar...@similarweb.com; user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; : Re: About memory leak in spark 1.4.1 http://spark.apache.org/docs/latest/tuning.html does mention spark.storage.memoryFraction in two places. One is under Cache Size Tuning section. FYI On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote: Hi, Barak It is ok with spark 1.3.0, the problem is with spark 1.4.1. I don't think spark.storage.memoryFraction will make any sense, because it is still in heap memory. -- -- ??: Barak Gitsis;bar...@similarweb.com; : 2015??8??2??(??) 4:11 ??: Sea261810...@qq.com; useruser@spark.apache.org; : rxinr...@databricks.com; joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; : Re: About memory leak in spark 1.4.1 Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get filled because it is reserved.. My reasoning is: I give executor all the memory i can give it, so that makes it a boundary. From here i try to make the best use of memory I can. storage.memoryFraction is in a sense user data space. The rest can be used by the system. If you don't have so much data that you MUST store in memory for performance, better give spark more space.. ended up setting it to 0.3 All that said, it is on spark 1.3 on cluster hope that helps On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote: Hi, all I upgrage spark to 1.4.1, many applications failed... I find the heap memory is not full , but the process of CoarseGrainedExecutorBackend will take more memory than I expect, and it will increase as time goes on, finally more than max limited of the server, the worker will die. Any can help?? Mode??standalone spark.executor.memory 50g 25583 xiaoju20 0 75.5g 55g 28m S 1729.3 88.1 2172:52 java 55g more than 50g I apply -- -Barak -- -Barak
Re: spark cluster setup
What do the master logs show? Best Regards, Sonal Founder, Nube Technologies http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=http%3A%2F%2Fwww.nubetech.co%2Fsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1 Check out Reifier at Spark Summit 2015 http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=https%3A%2F%2Fspark-summit.org%2F2015%2Fevents%2Freal-time-fuzzy-matching-with-spark-and-elastic-search%2Fsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1 http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=http%3A%2F%2Fin.linkedin.com%2Fin%2Fsonalgoyalsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1 On Mon, Aug 3, 2015 at 7:46 AM, Angel Angel areyouange...@gmail.com wrote: Hello Sir, I have install the spark. The local spark-shell is working fine. But whenever I tried the Master configuration I got some errors. When I run this command ; MASTER=spark://hadoopm0:7077 spark-shell I gets the errors likes; 15/07/27 21:17:26 INFO AppClient$ClientActor: Connecting to master spark://hadoopm0:7077... 15/07/27 21:17:46 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. 15/07/27 21:17:46 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet. 15/07/27 21:17:46 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up. Also I have attached the my screenshot of Master UI. Also i have tested using telnet command: it shows that hadoopm0 is connected Can you please give me some references, documentations or how to solve this issue. Thanks in advance. Thanking You, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: TCP/IP speedup
On 1 Aug 2015, at 18:26, Ruslan Dautkhanov dautkha...@gmail.commailto:dautkha...@gmail.com wrote: If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000) may increase bandwidth up to ~20%. http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm Enabling Jumbo Frames across the cluster improves bandwidth +1 you can also get better checksums of packets, so that the (very small but non-zero) risk of corrupted network packets drops a bit more. If Spark workload is not network bandwidth-bound, I can see it'll be a few percent to no improvement. Put differently: it shouldn't hurt. The shuffle phase is the most network heavy, especially as it can span the entire cluster that backbone bandwidth bisection bandwidth can become the bottleneck, and mean that jobs can interfere scheduling of work close to the HDFS data means that HDFS reads should often be local (the TCP stack gets bypassed entirely), or at least rack-local (sharing the switch, not any backbone) but there's other things there, as the slide talks about -stragglers: often a sign of pending HDD failure, as reads are retries. the classic hadoop MR engine detects these, can spin up alternate mappers (if you enable speculation), and will blacklist the node for further work. Sometimes though that straggling is just unbalanced data -some bits of work may be computationally a lot harder, slowing things down. -contention for work on the nodes. In YARN you request how many virtual cores you want (ops get to define the map of virtual to physical), with each node having a finite set of cores but ... -Unless CPU throttling is turned on, competing processes can take up more CPU than they asked for. -that virtual:physical core setting may be of There's also disk IOP contention; two jobs trying to get at the same spindle, even though there are lots of disks on the server. There's not much you can do about that (today). A key takeaway from that talk, which applies to all work-tuning talks is: get data from your real workloads, There's some good htrace instrumentation in HDFS these days, I haven't looked @ spark's instrumentation to see how they hook up. You can also expect to have some network monitoring (sflow, ...) which you could use to see if the backbone is overloaded. Don't forget the Linux tooling either, iotop c. There's lots of room to play here -once you've got the data you can see where to focus, then decide how much time to spend trying to tune it. -steve -- Ruslan Dautkhanov On Sat, Aug 1, 2015 at 6:08 PM, Simon Edelhaus edel...@gmail.commailto:edel...@gmail.com wrote: H 2% huh. -- ttfn Simon Edelhaus California 2015 On Sat, Aug 1, 2015 at 3:45 PM, Mark Hamstra m...@clearstorydata.commailto:m...@clearstorydata.com wrote: https://spark-summit.org/2015/events/making-sense-of-spark-performance/ On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus edel...@gmail.commailto:edel...@gmail.com wrote: Hi All! How important would be a significant performance improvement to TCP/IP itself, in terms of overall job performance improvement. Which part would be most significantly accelerated? Would it be HDFS? -- ttfn Simon Edelhaus California 2015
Re: How to increase parallelism of a Spark cluster?
On 2 Aug 2015, at 13:42, Sujit Pal sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote: There is no additional configuration on the external Solr host from my code, I am using the default HttpClient provided by HttpSolrServer. According to the Javadocs, you can pass in a HttpClient object as well. Is there some specific configuration you would suggest to get past any limits? Usually there's some thread pooling going on client side, covered in docs like http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html I don't know if that applies, how to tune it, etc. I do know that if you go the other way and allow unlimited connections you raise different support problems. -steve