Cannot Import Package (spark-csv)

2015-08-02 Thread billchambers
I am trying to import the spark csv package while using the scala spark
shell. Spark 1.4.1, Scala 2.11

I am starting the shell with:

bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars
../sjars/spark-csv_2.11-1.1.0.jar --master local


I then try and run



and get the following error:



What am i doing wrong?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cannot Import Package (spark-csv)

2015-08-02 Thread Ted Yu
The command you ran and the error you got were not visible.

Mind sending them again ?

Cheers

On Sun, Aug 2, 2015 at 8:33 PM, billchambers wchamb...@ischool.berkeley.edu
 wrote:

 I am trying to import the spark csv package while using the scala spark
 shell. Spark 1.4.1, Scala 2.11

 I am starting the shell with:

 bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars
 ../sjars/spark-csv_2.11-1.1.0.jar --master local


 I then try and run



 and get the following error:



 What am i doing wrong?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Cannot Import Package (spark-csv)

2015-08-02 Thread billchambers
Commands again are:

Sure the commands are:

scala val df =
sqlContext.read.format(com.databricks.spark.csv).option(header,
true).load(cars.csv)

and get the following error: 

java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.csv
  at scala.sys.package$.error(package.scala:27)
  at
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
  at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
  ... 49 elided



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109p24110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cannot Import Package (spark-csv)

2015-08-02 Thread Ted Yu
I tried the following command on master branch:
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --jars
../spark-csv_2.10-1.0.3.jar --master local

I didn't reproduce the error with your command.

FYI

On Sun, Aug 2, 2015 at 8:57 PM, Bill Chambers 
wchamb...@ischool.berkeley.edu wrote:

 Sure the commands are:

 scala val df =
 sqlContext.read.format(com.databricks.spark.csv).option(header,
 true).load(cars.csv)

 and get the following error:

 java.lang.RuntimeException: Failed to load class for data source:
 com.databricks.spark.csv
   at scala.sys.package$.error(package.scala:27)
   at
 org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
   at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
   ... 49 elided

 On Sun, Aug 2, 2015 at 8:56 PM, Ted Yu yuzhih...@gmail.com wrote:

 The command you ran and the error you got were not visible.

 Mind sending them again ?

 Cheers

 On Sun, Aug 2, 2015 at 8:33 PM, billchambers 
 wchamb...@ischool.berkeley.edu wrote:

 I am trying to import the spark csv package while using the scala spark
 shell. Spark 1.4.1, Scala 2.11

 I am starting the shell with:

 bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars
 ../sjars/spark-csv_2.11-1.1.0.jar --master local


 I then try and run



 and get the following error:



 What am i doing wrong?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 Bill Chambers
 http://billchambers.me/
 Email wchamb...@ischool.berkeley.edu | LinkedIn
 http://linkedin.com/in/wachambers | Twitter
 https://twitter.com/b_a_chambers | Github https://github.com/anabranch



Checkpoint file not found

2015-08-02 Thread Anand Nalya
Hi,

I'm writing a Streaming application in Spark 1.3. After running for some
time, I'm getting following execption. I'm sure, that no other process is
modifying the hdfs file. Any idea, what might be the cause of this?

15/08/02 21:24:13 ERROR scheduler.DAGSchedulerEventProcessLoop:
DAGSchedulerEventProcessLoop failed; shutting down SparkContext
java.io.FileNotFoundException: File does not exist:
hdfs://node16:8020/user/anandnalya/tiered-original/e6794c2c-1c9f-414a-ae7e-e58a8f874661/rdd-5112/part-0
at
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1132)
at
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1124)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1124)
at
org.apache.spark.rdd.CheckpointRDD.getPreferredLocations(CheckpointRDD.scala:66)
at
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:230)
at
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:230)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1324)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1331)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1331)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1331)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1331)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1331)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1331)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at

Re: Spark Number of Partitions Recommendations

2015-08-02 Thread Понькин Алексей
Yes, I forgot to mention
I chose prime number as a modulo for hash function because my keys are usually 
strings and spark calculates particular partitiion using key hash(see 
HashPartitioner.scala) So, to avoid big number of collisions(when many keys 
located in few partition) it is common to use prime number in modulo. But it 
makes sense only for String keys offcourse, because of hash function. If yuo 
have different hash function for key of different type you can use any other 
modulo instead prime number.
I like this discussion on this topic 
http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus


-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1t=1


02.08.2015, 00:14, Ruslan Dautkhanov dautkha...@gmail.com:
 You should also take into account amount of memory that you plan to use.
 It's advised not to give too much memory for each executor .. otherwise GC 
 overhead will go up.

 Btw, why prime numbers?

 --
 Ruslan Dautkhanov

 On Wed, Jul 29, 2015 at 3:31 AM, ponkin alexey.pon...@ya.ru wrote:
 Hi Rahul,

 Where did you see such a recommendation?
 I personally define partitions with the following formula

 partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) )

 where
 nextPrimeNumberAbove(x) - prime number which is greater than x
 K - multiplicator  to calculate start with 1 and encrease untill join
 perfomance start to degrade

 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html

 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: About memory leak in spark 1.4.1

2015-08-02 Thread Barak Gitsis
Hi,
reducing spark.storage.memoryFraction did the trick for me. Heap doesn't
get filled because it is reserved..
My reasoning is:
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can.
storage.memoryFraction is in a sense user data space.  The rest can be used
by the system.
If you don't have so much data that you MUST store in memory for
performance, better give spark more space..
ended up setting it to 0.3

All that said, it is on spark 1.3 on cluster

hope that helps

On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote:

 Hi, all
 I upgrage spark to 1.4.1, many applications failed... I find the heap
 memory is not full , but the process of CoarseGrainedExecutorBackend will
 take more memory than I expect, and it will increase as time goes on,
 finally more than max limited of the server, the worker will die.

 Any can help?

 Mode:standalone

 spark.executor.memory 50g

 25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java

 55g more than 50g I apply

 --
*-Barak*


Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Akhil Das
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA
to add this feature and may be in future release it could be added.

Thanks
Best Regards

On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly moreill...@qub.ac.uk
wrote:

 Hi,

 I am currently working on the latest version of Apache Spark (1.4.1),
 pre-built package for Hadoop 2.6+.

 Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache
 (something similar is Altibase's HDB:
 http://altibase.com/in-memory-database-computing-solutions/security/)
 when running applications in Spark? Or is there an external
 library/framework which could be used to encrypt RDDs or in-memory/cache in
 Spark?

 I discovered it is possible to encrypt the data, and encapsulate it into
 RDD. However, I feel this affects Spark's fast data processing as it is
 slower to encrypt the data, and then encapsulate it to RDD; it's then a two
 step process. Encryption and storing data should be done parallel.

 Any help would be appreciated.

 Many thanks,
 Matthew


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 4:11
??: Sea261810...@qq.com; useruser@spark.apache.org; 
: rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said, it is on spark 1.3 on cluster


hope that helps


On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote:

Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply



-- 

-Barak

Re: Does Spark Streaming need to list all the files in a directory?

2015-08-02 Thread Akhil Das
I guess it goes through that 500k files
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193for
the first time and then use a filter from next time.

Thanks
Best Regards

On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das t...@databricks.com wrote:

 For the first time it needs to list them. AFter that the list should be
 cached by the file stream implementation (as far as I remember).


 On Thu, Jul 30, 2015 at 3:55 PM, Brandon White bwwintheho...@gmail.com
 wrote:

 Is this a known bottle neck for Spark Streaming textFileStream? Does it
 need to list all the current files in a directory before he gets the new
 files? Say I have 500k files in a directory, does it list them all in order
 to get the new files?





Re: unsubscribe

2015-08-02 Thread Akhil Das
​LOL Brandon!

@ziqiu See http://spark.apache.org/community.html

You need to send an email to user-unsubscr...@spark.apache.org​

Thanks
Best Regards

On Fri, Jul 31, 2015 at 2:06 AM, Brandon White bwwintheho...@gmail.com
wrote:

 https://www.youtube.com/watch?v=JncgoPKklVE

 On Thu, Jul 30, 2015 at 1:30 PM, ziqiu...@accenture.com wrote:



 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com





Re: About memory leak in spark 1.4.1

2015-08-02 Thread Barak Gitsis
spark uses a lot more than heap memory, it is the expected behavior.
in 1.4 off-heap memory usage is supposed to grow in comparison to 1.3

Better use as little memory as you can for heap, and since you are not
utilizing it already, it is safe for you to reduce it.
memoryFraction helps you optimize heap usage for your data/application
profile while keeping it tight.






On Sun, Aug 2, 2015 at 12:54 PM Sea 261810...@qq.com wrote:

 spark.storage.memoryFraction is in heap memory, but my situation is that
 the memory is more than heap memory !

 Anyone else use spark 1.4.1 in production?


 -- 原始邮件 --
 *发件人:* Ted Yu;yuzhih...@gmail.com;
 *发送时间:* 2015年8月2日(星期天) 下午5:45
 *收件人:* Sea261810...@qq.com;
 *抄送:* Barak Gitsisbar...@similarweb.com; user@spark.apache.org
 user@spark.apache.org; rxinr...@databricks.com; joshrosen
 joshro...@databricks.com; daviesdav...@databricks.com;
 *主题:* Re: About memory leak in spark 1.4.1

 http://spark.apache.org/docs/latest/tuning.html does mention 
 spark.storage.memoryFraction
 in two places.
 One is under Cache Size Tuning section.

 FYI

 On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote:

 Hi, Barak
 It is ok with spark 1.3.0, the problem is with spark 1.4.1.
 I don't think spark.storage.memoryFraction will make any sense,
 because it is still in heap memory.


 -- 原始邮件 --
 *发件人:* Barak Gitsis;bar...@similarweb.com;
 *发送时间:* 2015年8月2日(星期天) 下午4:11
 *收件人:* Sea261810...@qq.com; useruser@spark.apache.org;
 *抄送:* rxinr...@databricks.com; joshrosenjoshro...@databricks.com;
 daviesdav...@databricks.com;
 *主题:* Re: About memory leak in spark 1.4.1

 Hi,
 reducing spark.storage.memoryFraction did the trick for me. Heap doesn't
 get filled because it is reserved..
 My reasoning is:
 I give executor all the memory i can give it, so that makes it a boundary
 .
 From here i try to make the best use of memory I can.
 storage.memoryFraction is in a sense user data space.  The rest can be used
 by the system.
 If you don't have so much data that you MUST store in memory for
 performance, better give spark more space..
 ended up setting it to 0.3

 All that said, it is on spark 1.3 on cluster

 hope that helps

 On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote:

 Hi, all
 I upgrage spark to 1.4.1, many applications failed... I find the heap
 memory is not full , but the process of CoarseGrainedExecutorBackend will
 take more memory than I expect, and it will increase as time goes on,
 finally more than max limited of the server, the worker will die.

 Any can help?

 Mode:standalone

 spark.executor.memory 50g

 25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java

 55g more than 50g I apply

 --
 *-Barak*


 --
*-Barak*


spark no output

2015-08-02 Thread Pa Rö
hi community,

i have run my k-means spark application on 1million data points. the
program works, but no output in the hdfs is generated. when it runs on
10.000 points, a output is written.

maybe someone has an idea?

best regards,
paul


Re: spark no output

2015-08-02 Thread Ted Yu
Can you provide some more detai:

release of Spark you're using
were you running in standalone or YARN cluster mode
have you checked driver log ?

Cheers

On Sun, Aug 2, 2015 at 7:04 AM, Pa Rö paul.roewer1...@googlemail.com
wrote:

 hi community,

 i have run my k-means spark application on 1million data points. the
 program works, but no output in the hdfs is generated. when it runs on
 10.000 points, a output is written.

 maybe someone has an idea?

 best regards,
 paul



Re: spark no output

2015-08-02 Thread Connor Zanin
I agree with Ted. Could you please post the log file?
On Aug 2, 2015 10:13 AM, Ted Yu yuzhih...@gmail.com wrote:

 Can you provide some more detai:

 release of Spark you're using
 were you running in standalone or YARN cluster mode
 have you checked driver log ?

 Cheers

 On Sun, Aug 2, 2015 at 7:04 AM, Pa Rö paul.roewer1...@googlemail.com
 wrote:

 hi community,

 i have run my k-means spark application on 1million data points. the
 program works, but no output in the hdfs is generated. when it runs on
 10.000 points, a output is written.

 maybe someone has an idea?

 best regards,
 paul





Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Jörn Franke
I think you use case can already be implemented with HDFS encryption and/or
SealedObject, if you look for sth like Altibase.

If you create a JIRA you may want to set the bar a little bit higher and
propose sth like MIT cryptdb: https://css.csail.mit.edu/cryptdb/

Le ven. 31 juil. 2015 à 10:17, Matthew O'Reilly moreill...@qub.ac.uk a
écrit :

 Hi,

 I am currently working on the latest version of Apache Spark (1.4.1),
 pre-built package for Hadoop 2.6+.

 Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache
 (something similar is Altibase's HDB:
 http://altibase.com/in-memory-database-computing-solutions/security/)
 when running applications in Spark? Or is there an external
 library/framework which could be used to encrypt RDDs or in-memory/cache in
 Spark?

 I discovered it is possible to encrypt the data, and encapsulate it into
 RDD. However, I feel this affects Spark's fast data processing as it is
 slower to encrypt the data, and then encapsulate it to RDD; it's then a two
 step process. Encryption and storing data should be done parallel.

 Any help would be appreciated.

 Many thanks,
 Matthew


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
spark.storage.memoryFraction is in heap memory, but my situation is that the 
memory is more than heap memory !  


Anyone else use spark 1.4.1 in production? 




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??8??2??(??) 5:45
??: Sea261810...@qq.com; 
: Barak Gitsisbar...@similarweb.com; 
user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; 
joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



http://spark.apache.org/docs/latest/tuning.html does mention 
spark.storage.memoryFraction in two places.
One is under Cache Size Tuning section.


FYI


On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote:
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 4:11
??: Sea261810...@qq.com; useruser@spark.apache.org; 
: rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said, it is on spark 1.3 on cluster


hope that helps


On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote:

Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply



-- 

-Barak

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
What kind of cluster? How many cores on each worker? Is there config for
http solr client? I remember standard httpclient has limit per route/host.
On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote:

 No one has any ideas?

 Is there some more information I should provide?

 I am looking for ways to increase the parallelism among workers. Currently
 I just see number of simultaneous connections to Solr equal to the number
 of workers. My number of partitions is (2.5x) larger than number of
 workers, and the workers seem to be large enough to handle more than one
 task at a time.

 I am creating a single client per partition in my mapPartition call. Not
 sure if that is creating the gating situation? Perhaps I should use a Pool
 of clients instead?

 Would really appreciate some pointers.

 Thanks in advance for any help you can provide.

 -sujit


 On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote:

 Hello,

 I am trying to run a Spark job that hits an external webservice to get
 back some information. The cluster is 1 master + 4 workers, each worker has
 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
 and is accessed using code similar to that shown below.

 def getResults(keyValues: Iterator[(String, Array[String])]):
 Iterator[(String, String)] = {
 val solr = new HttpSolrClient()
 initializeSolrParameters(solr)
 keyValues.map(keyValue = (keyValue._1, process(solr, keyValue)))
 }
 myRDD.repartition(10)

  .mapPartitions(keyValues = getResults(keyValues))


 The mapPartitions does some initialization to the SolrJ client per
 partition and then hits it for each record in the partition via the
 getResults() call.

 I repartitioned in the hope that this will result in 10 clients hitting
 Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
 clients if I can). However, I counted the number of open connections using
 netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and
 observed that Solr has a constant 4 clients (ie, equal to the number of
 workers) over the lifetime of the run.

 My observation leads me to believe that each worker processes a single
 stream of work sequentially. However, from what I understand about how
 Spark works, each worker should be able to process number of tasks
 parallelly, and that repartition() is a hint for it to do so.

 Is there some SparkConf environment variable I should set to increase
 parallelism in these workers, or should I just configure a cluster with
 multiple workers per machine? Or is there something I am doing wrong?

 Thank you in advance for any pointers you can provide.

 -sujit





Re: TCP/IP speedup

2015-08-02 Thread Michael Segel
This may seem like a silly question… but in following Mark’s link, the 
presentation talks about the TPC-DS benchmark. 

Here’s my question… what benchmark results? 

If you go over to the TPC.org http://tpc.org/ website they have no TPC-DS 
benchmarks listed. 
(Either audited or unaudited) 

So what gives? 

Note: There are TPCx-HS benchmarks listed… 

Thx

-Mike

 On Aug 1, 2015, at 5:45 PM, Mark Hamstra m...@clearstorydata.com wrote:
 
 https://spark-summit.org/2015/events/making-sense-of-spark-performance/ 
 https://spark-summit.org/2015/events/making-sense-of-spark-performance/
 
 On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus edel...@gmail.com 
 mailto:edel...@gmail.com wrote:
 Hi All!
 
 How important would be a significant performance improvement to TCP/IP 
 itself, in terms of 
 overall job performance improvement. Which part would be most significantly 
 accelerated? 
 Would it be HDFS?
 
 -- ttfn
 Simon Edelhaus
 California 2015
 




how to ignore MatchError then processing a large json file in spark-sql

2015-08-02 Thread fuellee lee
I'm trying to process a bunch of large json log files with spark, but it
fails every time with `scala.MatchError`, Whether I give it schema or not.

I just want to skip lines that does not match schema, but I can't find how
in docs of spark.

I know write a json parser and map it to json file RDD can get things done,
but I want to use
`sqlContext.read.schema(schema).json(fileNames).selectExpr(...)` because
it's much easier to maintain.

thanks


Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
No one has any ideas?

Is there some more information I should provide?

I am looking for ways to increase the parallelism among workers. Currently
I just see number of simultaneous connections to Solr equal to the number
of workers. My number of partitions is (2.5x) larger than number of
workers, and the workers seem to be large enough to handle more than one
task at a time.

I am creating a single client per partition in my mapPartition call. Not
sure if that is creating the gating situation? Perhaps I should use a Pool
of clients instead?

Would really appreciate some pointers.

Thanks in advance for any help you can provide.

-sujit


On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote:

 Hello,

 I am trying to run a Spark job that hits an external webservice to get
 back some information. The cluster is 1 master + 4 workers, each worker has
 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
 and is accessed using code similar to that shown below.

 def getResults(keyValues: Iterator[(String, Array[String])]):
 Iterator[(String, String)] = {
 val solr = new HttpSolrClient()
 initializeSolrParameters(solr)
 keyValues.map(keyValue = (keyValue._1, process(solr, keyValue)))
 }
 myRDD.repartition(10)

  .mapPartitions(keyValues = getResults(keyValues))


 The mapPartitions does some initialization to the SolrJ client per
 partition and then hits it for each record in the partition via the
 getResults() call.

 I repartitioned in the hope that this will result in 10 clients hitting
 Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
 clients if I can). However, I counted the number of open connections using
 netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and
 observed that Solr has a constant 4 clients (ie, equal to the number of
 workers) over the lifetime of the run.

 My observation leads me to believe that each worker processes a single
 stream of work sequentially. However, from what I understand about how
 Spark works, each worker should be able to process number of tasks
 parallelly, and that repartition() is a hint for it to do so.

 Is there some SparkConf environment variable I should set to increase
 parallelism in these workers, or should I just configure a cluster with
 multiple workers per machine? Or is there something I am doing wrong?

 Thank you in advance for any pointers you can provide.

 -sujit




Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
so how many cores you configure per node?
do u have something like total-executor-cores or maybe
--num-executors config(I'm
not sure what kind of cluster databricks platform provides, if it's
standalone then first option should be used)? if you have 4 cores at total,
then even though you have 4 cores per machine only 1 is working on each
machine...which could be a cause.
another option - you are hitting some default config of limiting number of
concurrent routes or max total connection from jvm,
look at
https://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html
 (assuming you are using HttpClient from 4.x and not 3.x version)
not sure what are the defaults...



On 2 August 2015 at 23:42, Sujit Pal sujitatgt...@gmail.com wrote:

 Hi Igor,

 The cluster is a Databricks Spark cluster. It consists of 1 master + 4
 workers, each worker has 60GB RAM and 4 CPUs. The original mail has some
 more details (also the reference to the HttpSolrClient in there should be
 HttpSolrServer, sorry about that, mistake while writing the email).

 There is no additional configuration on the external Solr host from my
 code, I am using the default HttpClient provided by HttpSolrServer.
 According to the Javadocs, you can pass in a HttpClient object as well. Is
 there some specific configuration you would suggest to get past any limits?

 On another project, I faced a similar problem but I had more leeway (was
 using a Spark cluster from EC2) and less time, my workaround was to use
 python multiprocessing to create a program that started up 30 python
 JSON/HTTP clients and wrote output into 30 output files, which were then
 processed by Spark. Reason I mention this is that I was using default
 configurations there as well, just needed to increase the number of
 connections against Solr to a higher number.

 This time round, I would like to do this through Spark because it makes
 the pipeline less complex.

 -sujit


 On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman igor.ber...@gmail.com
 wrote:

 What kind of cluster? How many cores on each worker? Is there config for
 http solr client? I remember standard httpclient has limit per route/host.
 On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote:

 No one has any ideas?

 Is there some more information I should provide?

 I am looking for ways to increase the parallelism among workers.
 Currently I just see number of simultaneous connections to Solr equal to
 the number of workers. My number of partitions is (2.5x) larger than number
 of workers, and the workers seem to be large enough to handle more than one
 task at a time.

 I am creating a single client per partition in my mapPartition call. Not
 sure if that is creating the gating situation? Perhaps I should use a Pool
 of clients instead?

 Would really appreciate some pointers.

 Thanks in advance for any help you can provide.

 -sujit


 On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com
 wrote:

 Hello,

 I am trying to run a Spark job that hits an external webservice to get
 back some information. The cluster is 1 master + 4 workers, each worker has
 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
 and is accessed using code similar to that shown below.

 def getResults(keyValues: Iterator[(String, Array[String])]):
 Iterator[(String, String)] = {
 val solr = new HttpSolrClient()
 initializeSolrParameters(solr)
 keyValues.map(keyValue = (keyValue._1, process(solr, keyValue)))
 }
 myRDD.repartition(10)

  .mapPartitions(keyValues = getResults(keyValues))


 The mapPartitions does some initialization to the SolrJ client per
 partition and then hits it for each record in the partition via the
 getResults() call.

 I repartitioned in the hope that this will result in 10 clients hitting
 Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
 clients if I can). However, I counted the number of open connections using
 netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and
 observed that Solr has a constant 4 clients (ie, equal to the number of
 workers) over the lifetime of the run.

 My observation leads me to believe that each worker processes a single
 stream of work sequentially. However, from what I understand about how
 Spark works, each worker should be able to process number of tasks
 parallelly, and that repartition() is a hint for it to do so.

 Is there some SparkConf environment variable I should set to increase
 parallelism in these workers, or should I just configure a cluster with
 multiple workers per machine? Or is there something I am doing wrong?

 Thank you in advance for any pointers you can provide.

 -sujit






Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
Hi Igor,

The cluster is a Databricks Spark cluster. It consists of 1 master + 4
workers, each worker has 60GB RAM and 4 CPUs. The original mail has some
more details (also the reference to the HttpSolrClient in there should be
HttpSolrServer, sorry about that, mistake while writing the email).

There is no additional configuration on the external Solr host from my
code, I am using the default HttpClient provided by HttpSolrServer.
According to the Javadocs, you can pass in a HttpClient object as well. Is
there some specific configuration you would suggest to get past any limits?

On another project, I faced a similar problem but I had more leeway (was
using a Spark cluster from EC2) and less time, my workaround was to use
python multiprocessing to create a program that started up 30 python
JSON/HTTP clients and wrote output into 30 output files, which were then
processed by Spark. Reason I mention this is that I was using default
configurations there as well, just needed to increase the number of
connections against Solr to a higher number.

This time round, I would like to do this through Spark because it makes the
pipeline less complex.

-sujit


On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman igor.ber...@gmail.com wrote:

 What kind of cluster? How many cores on each worker? Is there config for
 http solr client? I remember standard httpclient has limit per route/host.
 On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote:

 No one has any ideas?

 Is there some more information I should provide?

 I am looking for ways to increase the parallelism among workers.
 Currently I just see number of simultaneous connections to Solr equal to
 the number of workers. My number of partitions is (2.5x) larger than number
 of workers, and the workers seem to be large enough to handle more than one
 task at a time.

 I am creating a single client per partition in my mapPartition call. Not
 sure if that is creating the gating situation? Perhaps I should use a Pool
 of clients instead?

 Would really appreciate some pointers.

 Thanks in advance for any help you can provide.

 -sujit


 On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com
 wrote:

 Hello,

 I am trying to run a Spark job that hits an external webservice to get
 back some information. The cluster is 1 master + 4 workers, each worker has
 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
 and is accessed using code similar to that shown below.

 def getResults(keyValues: Iterator[(String, Array[String])]):
 Iterator[(String, String)] = {
 val solr = new HttpSolrClient()
 initializeSolrParameters(solr)
 keyValues.map(keyValue = (keyValue._1, process(solr, keyValue)))
 }
 myRDD.repartition(10)

  .mapPartitions(keyValues = getResults(keyValues))


 The mapPartitions does some initialization to the SolrJ client per
 partition and then hits it for each record in the partition via the
 getResults() call.

 I repartitioned in the hope that this will result in 10 clients hitting
 Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
 clients if I can). However, I counted the number of open connections using
 netstat -anp | grep :8983.*ESTABLISHED in a loop on the Solr box and
 observed that Solr has a constant 4 clients (ie, equal to the number of
 workers) over the lifetime of the run.

 My observation leads me to believe that each worker processes a single
 stream of work sequentially. However, from what I understand about how
 Spark works, each worker should be able to process number of tasks
 parallelly, and that repartition() is a hint for it to do so.

 Is there some SparkConf environment variable I should set to increase
 parallelism in these workers, or should I just configure a cluster with
 multiple workers per machine? Or is there something I am doing wrong?

 Thank you in advance for any pointers you can provide.

 -sujit





RE: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Silvio Fiorito
Can you share the transformations up to the foreachPartition?

From: Sujit Palmailto:sujitatgt...@gmail.com
Sent: ‎8/‎2/‎2015 4:42 PM
To: Igor Bermanmailto:igor.ber...@gmail.com
Cc: usermailto:user@spark.apache.org
Subject: Re: How to increase parallelism of a Spark cluster?

Hi Igor,

The cluster is a Databricks Spark cluster. It consists of 1 master + 4 workers, 
each worker has 60GB RAM and 4 CPUs. The original mail has some more details 
(also the reference to the HttpSolrClient in there should be HttpSolrServer, 
sorry about that, mistake while writing the email).

There is no additional configuration on the external Solr host from my code, I 
am using the default HttpClient provided by HttpSolrServer. According to the 
Javadocs, you can pass in a HttpClient object as well. Is there some specific 
configuration you would suggest to get past any limits?

On another project, I faced a similar problem but I had more leeway (was using 
a Spark cluster from EC2) and less time, my workaround was to use python 
multiprocessing to create a program that started up 30 python JSON/HTTP clients 
and wrote output into 30 output files, which were then processed by Spark. 
Reason I mention this is that I was using default configurations there as well, 
just needed to increase the number of connections against Solr to a higher 
number.

This time round, I would like to do this through Spark because it makes the 
pipeline less complex.

-sujit


On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman 
igor.ber...@gmail.commailto:igor.ber...@gmail.com wrote:

What kind of cluster? How many cores on each worker? Is there config for http 
solr client? I remember standard httpclient has limit per route/host.

On Aug 2, 2015 8:17 PM, Sujit Pal 
sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote:
No one has any ideas?

Is there some more information I should provide?

I am looking for ways to increase the parallelism among workers. Currently I 
just see number of simultaneous connections to Solr equal to the number of 
workers. My number of partitions is (2.5x) larger than number of workers, and 
the workers seem to be large enough to handle more than one task at a time.

I am creating a single client per partition in my mapPartition call. Not sure 
if that is creating the gating situation? Perhaps I should use a Pool of 
clients instead?

Would really appreciate some pointers.

Thanks in advance for any help you can provide.

-sujit


On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal 
sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote:
Hello,

I am trying to run a Spark job that hits an external webservice to get back 
some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM 
and 4 CPUs. The external webservice is a standalone Solr server, and is 
accessed using code similar to that shown below.

def getResults(keyValues: Iterator[(String, Array[String])]):
Iterator[(String, String)] = {
val solr = new HttpSolrClient()
initializeSolrParameters(solr)
keyValues.map(keyValue = (keyValue._1, process(solr, keyValue)))
}
myRDD.repartition(10)
 .mapPartitions(keyValues = getResults(keyValues))

The mapPartitions does some initialization to the SolrJ client per partition 
and then hits it for each record in the partition via the getResults() call.

I repartitioned in the hope that this will result in 10 clients hitting Solr 
simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I 
can). However, I counted the number of open connections using netstat -anp | 
grep :8983.*ESTABLISHED in a loop on the Solr box and observed that Solr has 
a constant 4 clients (ie, equal to the number of workers) over the lifetime of 
the run.

My observation leads me to believe that each worker processes a single stream 
of work sequentially. However, from what I understand about how Spark works, 
each worker should be able to process number of tasks parallelly, and that 
repartition() is a hint for it to do so.

Is there some SparkConf environment variable I should set to increase 
parallelism in these workers, or should I just configure a cluster with 
multiple workers per machine? Or is there something I am doing wrong?

Thank you in advance for any pointers you can provide.

-sujit





Extremely poor predictive performance with RF in mllib

2015-08-02 Thread pkphlam
Hi,

This might be a long shot, but has anybody run into very poor predictive
performance using RandomForest with Mllib? Here is what I'm doing:

- Spark 1.4.1 with PySpark
- Python 3.4.2
- ~30,000 Tweets of text
- 12289 1s and 15956 0s
- Whitespace tokenization and then hashing trick for feature selection using
10,000 features
- Run RF with 100 trees and maxDepth of 4 and then predict using the
features from all the 1s observations.

So in theory, I should get predictions of close to 12289 1s (especially if
the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous to
me and makes me suspect something is wrong with my code or I'm missing
something. I notice similar behavior (although not as extreme) if I play
around with the settings. But I'm getting normal behavior with other
classifiers, so I don't think it's my setup that's the problem.

For example:

 lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
 logit_predict = lrm.predict(predict_feat)
 logit_predict.sum()
9077

 nb = NaiveBayes.train(lp)
 nb_predict = nb.predict(predict_feat)
 nb_predict.sum()
10287.0

 rf = RandomForest.trainClassifier(lp, numClasses=2,
 categoricalFeaturesInfo={}, numTrees=100, seed=422)
 rf_predict = rf.predict(predict_feat)
 rf_predict.sum()
0.0

This code was all run back to back so I didn't change anything in between.
Does anybody have a possible explanation for this?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Abhishek R. Singh
I don't know if (your assertion/expectation that) workers will process things 
(multiple partitions) in parallel is really valid. Or if having more partitions 
than workers will necessarily help (unless you are memory bound - so partitions 
is essentially helping your work size rather than execution parallelism).

[Disclaimer: I am no authority on Spark, but wanted to throw my spin based my 
own understanding]. 

Nothing official about it :)

-abhishek-

 On Jul 31, 2015, at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote:
 
 Hello,
 
 I am trying to run a Spark job that hits an external webservice to get back 
 some information. The cluster is 1 master + 4 workers, each worker has 60GB 
 RAM and 4 CPUs. The external webservice is a standalone Solr server, and is 
 accessed using code similar to that shown below.
 
 def getResults(keyValues: Iterator[(String, Array[String])]):
 Iterator[(String, String)] = {
 val solr = new HttpSolrClient()
 initializeSolrParameters(solr)
 keyValues.map(keyValue = (keyValue._1, process(solr, keyValue)))
 }
 myRDD.repartition(10) 
  .mapPartitions(keyValues = getResults(keyValues))
  
 The mapPartitions does some initialization to the SolrJ client per partition 
 and then hits it for each record in the partition via the getResults() call.
 
 I repartitioned in the hope that this will result in 10 clients hitting Solr 
 simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I 
 can). However, I counted the number of open connections using netstat -anp | 
 grep :8983.*ESTABLISHED in a loop on the Solr box and observed that Solr 
 has a constant 4 clients (ie, equal to the number of workers) over the 
 lifetime of the run.
 
 My observation leads me to believe that each worker processes a single stream 
 of work sequentially. However, from what I understand about how Spark works, 
 each worker should be able to process number of tasks parallelly, and that 
 repartition() is a hint for it to do so.
 
 Is there some SparkConf environment variable I should set to increase 
 parallelism in these workers, or should I just configure a cluster with 
 multiple workers per machine? Or is there something I am doing wrong?
 
 Thank you in advance for any pointers you can provide.
 
 -sujit
 


Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
spark uses a lot more than heap memory, it is the expected behavior.  It 
didn't exist in spark 1.3.x
What does a lot more than means?  It means that I lose control of it!
I try to  apply 31g, but it still grows to 55g and continues to grow!!! That is 
the point!
I have tried set memoryFraction to 0.2??but it didn't help.
I don't know whether it will still exist in the next release 1.5, I wish not.






--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 9:55
??: Sea261810...@qq.com; Ted Yuyuzhih...@gmail.com; 
: user@spark.apache.orguser@spark.apache.org; 
rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



spark uses a lot more than heap memory, it is the expected behavior.in 1.4 
off-heap memory usage is supposed to grow in comparison to 1.3


Better use as little memory as you can for heap, and since you are not 
utilizing it already, it is safe for you to reduce it.
memoryFraction helps you optimize heap usage for your data/application profile 
while keeping it tight.



 






On Sun, Aug 2, 2015 at 12:54 PM Sea 261810...@qq.com wrote:

spark.storage.memoryFraction is in heap memory, but my situation is that the 
memory is more than heap memory !  


Anyone else use spark 1.4.1 in production? 




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??8??2??(??) 5:45
??: Sea261810...@qq.com; 
: Barak Gitsisbar...@similarweb.com; 
user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; 
joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; 


: Re: About memory leak in spark 1.4.1




http://spark.apache.org/docs/latest/tuning.html does mention 
spark.storage.memoryFraction in two places.
One is under Cache Size Tuning section.


FYI


On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote:
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 4:11
??: Sea261810...@qq.com; useruser@spark.apache.org; 
: rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said, it is on spark 1.3 on cluster


hope that helps


On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote:

Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply



-- 

-Barak








-- 

-Barak

Re: spark cluster setup

2015-08-02 Thread Sonal Goyal
What do the master logs show?

Best Regards,
Sonal
Founder, Nube Technologies
http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=http%3A%2F%2Fwww.nubetech.co%2Fsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1

Check out Reifier at Spark Summit 2015
http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=https%3A%2F%2Fspark-summit.org%2F2015%2Fevents%2Freal-time-fuzzy-matching-with-spark-and-elastic-search%2Fsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1

http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=http%3A%2F%2Fin.linkedin.com%2Fin%2Fsonalgoyalsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1



On Mon, Aug 3, 2015 at 7:46 AM, Angel Angel areyouange...@gmail.com wrote:

 Hello Sir,

 I have install the spark.



 The local  spark-shell is working fine.



 But whenever I tried the Master configuration I got some errors.



 When I run this command ;

 MASTER=spark://hadoopm0:7077 spark-shell



 I gets the errors likes;



 15/07/27 21:17:26 INFO AppClient$ClientActor: Connecting to master
 spark://hadoopm0:7077...

 15/07/27 21:17:46 ERROR SparkDeploySchedulerBackend: Application has been
 killed. Reason: All masters are unresponsive! Giving up.

 15/07/27 21:17:46 WARN SparkDeploySchedulerBackend: Application ID is not
 initialized yet.

 15/07/27 21:17:46 ERROR TaskSchedulerImpl: Exiting due to error from
 cluster scheduler: All masters are unresponsive! Giving up.



 Also I have attached the my screenshot of Master UI.


 Also i have tested using telnet command:


 it shows that hadoopm0 is connected



 Can you please give me some references, documentations or  how to solve
 this issue.

 Thanks in advance.

 Thanking You,


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: TCP/IP speedup

2015-08-02 Thread Steve Loughran

On 1 Aug 2015, at 18:26, Ruslan Dautkhanov 
dautkha...@gmail.commailto:dautkha...@gmail.com wrote:

If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000)
may increase bandwidth up to ~20%.

http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm
Enabling Jumbo Frames across the cluster improves bandwidth

+1

you can also get better checksums of packets, so that the (very small but 
non-zero) risk of corrupted network packets drops a bit more.


If Spark workload is not network bandwidth-bound, I can see it'll be a few 
percent to no improvement.



Put differently: it shouldn't hurt. The shuffle phase is the most network 
heavy, especially as it can span the entire cluster that backbone bandwidth 
bisection bandwidth can become the bottleneck, and mean that jobs can 
interfere

scheduling of work close to the HDFS data means that HDFS reads should often be 
local (the TCP stack gets bypassed entirely), or at least rack-local (sharing 
the switch, not any backbone)


but there's other things there, as the slide talks about


-stragglers: often a sign of pending HDD failure, as reads are retries. the 
classic hadoop MR engine detects these, can spin up alternate mappers (if you 
enable speculation), and will blacklist the node for further work. Sometimes 
though that straggling is just unbalanced data -some bits of work may be 
computationally a lot harder, slowing things down.

-contention for work on the nodes. In YARN you request how many virtual cores 
you want (ops get to define the map of virtual to physical), with each node 
having a finite set of cores

but ...
  -Unless CPU throttling is turned on, competing processes can take up more CPU 
than they asked for.
  -that virtual:physical core setting may be of

There's also disk IOP contention; two jobs trying to get at the same spindle, 
even though there are lots of disks on the server. There's not much you can do 
about that (today).

A key takeaway from that talk, which applies to all work-tuning talks is: get 
data from your real workloads, There's some good htrace instrumentation in HDFS 
these days, I haven't looked @ spark's instrumentation to see how they hook up. 
You can also expect to have some network monitoring (sflow, ...) which you 
could use to see if the backbone is overloaded. Don't forget the Linux tooling 
either, iotop c. There's lots of room to play here -once you've got the data 
you can see where to focus, then decide how much time to spend trying to tune 
it.

-steve


--
Ruslan Dautkhanov

On Sat, Aug 1, 2015 at 6:08 PM, Simon Edelhaus 
edel...@gmail.commailto:edel...@gmail.com wrote:
H

2% huh.


-- ttfn
Simon Edelhaus
California 2015

On Sat, Aug 1, 2015 at 3:45 PM, Mark Hamstra 
m...@clearstorydata.commailto:m...@clearstorydata.com wrote:
https://spark-summit.org/2015/events/making-sense-of-spark-performance/

On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus 
edel...@gmail.commailto:edel...@gmail.com wrote:
Hi All!

How important would be a significant performance improvement to TCP/IP itself, 
in terms of
overall job performance improvement. Which part would be most significantly 
accelerated?
Would it be HDFS?

-- ttfn
Simon Edelhaus
California 2015






Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Steve Loughran

On 2 Aug 2015, at 13:42, Sujit Pal 
sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote:

There is no additional configuration on the external Solr host from my code, I 
am using the default HttpClient provided by HttpSolrServer. According to the 
Javadocs, you can pass in a HttpClient object as well. Is there some specific 
configuration you would suggest to get past any limits?


Usually there's some thread pooling going on client side, covered in docs like
http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html

I don't know if that applies, how to tune it, etc. I do know that if you go the 
other way and allow unlimited connections you raise different support 
problems.

-steve