Re: use case reading files split per id

2016-11-14 Thread Mo Tao
Hi ruben,

You may try sc.binaryFiles which is designed for lots of small files and it
can map paths into inputstreams.
Each inputstream will keep only the path and some configuration, so it would
be cheap to shuffle them.
However, I'm not sure whether spark take the data locality into account
while dealing with these inputstreams.

Hope this helps



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/use-case-reading-files-split-per-id-tp28044p28075.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Cannot find Native Library in "cluster" deploy-mode

2016-11-14 Thread Mo Tao
Hi jtgenesis,

UnsatisfiedLinkError could be caused by the missing library that your .so
files require, so you may have a look at the exception message.

You can also try setExecutorEnv("LD_LIBRARY_PATH", ".:" +
sys.env("LD_LIBRARY_PATH")) and then submit your job with your .so files
using the --files option. Spark would put these .so files in the working
directory of each exectuor.

Best



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-find-Native-Library-in-cluster-deploy-mode-tp28072p28074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to change the parallelism level of input dstreams

2014-04-09 Thread Dong Mo
 Dear list,

A quick question about spark streaming:

Say I have this stage set up in my Spark Streaming cluster:

batched TCP stream ==> map(expensive computation) ===> ReduceByKey

I know I can set the number of tasks for ReduceByKey.

But I didn't find a place to specify the parallelism for the input
dstream(RDD sequence generated after the TCP stream). Do I need to
explicitly call repartition() to split the input RDD streams into many
parititions? If that is the case, what is the mechanism used to split the
RDD stream? Random fully reparation on each (K,V) pair (effectively a
shuffle) or more like rebalance?
And what is the default parallelism level for input stream?

Thank you so much
-Mo


Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-09 Thread Dong Mo
All of these works

Thanks
-Mo


2014-04-09 2:34 GMT-04:00 Xiangrui Meng :

> After sbt/sbt gen-diea, do not import as an SBT project but choose
> "open project" and point it to the spark folder. -Xiangrui
>
> On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen  wrote:
> > I let IntelliJ read the Maven build directly and that works fine.
> > --
> > Sean Owen | Director, Data Science | London
> >
> >
> > On Wed, Apr 9, 2014 at 6:14 AM, Dong Mo  wrote:
> >> Dear list,
> >>
> >> SBT compiles fine, but when I do the following:
> >> sbt/sbt gen-idea
> >> import project as SBT project to IDEA 13.1
> >> Make Project
> >> and these errors show up:
> >>
> >> Error:(28, 8) object FileContext is not a member of package
> >> org.apache.hadoop.fs
> >> import org.apache.hadoop.fs.{FileContext, FileStatus, FileSystem, Path,
> >> FileUtil}
> >>^
> >> Error:(31, 8) object Master is not a member of package
> >> org.apache.hadoop.mapred
> >> import org.apache.hadoop.mapred.Master
> >>^
> >> Error:(34, 26) object yarn is not a member of package org.apache.hadoop
> >> import org.apache.hadoop.yarn.api._
> >>  ^
> >> Error:(35, 26) object yarn is not a member of package org.apache.hadoop
> >> import org.apache.hadoop.yarn.api.ApplicationConstants.Environment
> >>  ^
> >> Error:(36, 26) object yarn is not a member of package org.apache.hadoop
> >> import org.apache.hadoop.yarn.api.protocolrecords._
> >>  ^
> >> Error:(37, 26) object yarn is not a member of package org.apache.hadoop
> >> import org.apache.hadoop.yarn.api.records._
> >>  ^
> >> Error:(38, 26) object yarn is not a member of package org.apache.hadoop
> >> import org.apache.hadoop.yarn.client.YarnClientImpl
> >>  ^
> >> Error:(39, 26) object yarn is not a member of package org.apache.hadoop
> >> import org.apache.hadoop.yarn.conf.YarnConfiguration
> >>  ^
> >> Error:(40, 26) object yarn is not a member of package org.apache.hadoop
> >> import org.apache.hadoop.yarn.ipc.YarnRPC
> >>  ^
> >> Error:(41, 26) object yarn is not a member of package org.apache.hadoop
> >> import org.apache.hadoop.yarn.util.{Apps, Records}
> >>  ^
> >> Error:(49, 11) not found: type YarnClientImpl
> >>   extends YarnClientImpl with Logging {
> >>   ^
> >> Error:(48, 20) not found: type ClientArguments
> >> class Client(args: ClientArguments, conf: Configuration, sparkConf:
> >> SparkConf)
> >>^
> >> Error:(51, 18) not found: type ClientArguments
> >>   def this(args: ClientArguments, sparkConf: SparkConf) =
> >>  ^
> >> Error:(54, 18) not found: type ClientArguments
> >>   def this(args: ClientArguments) = this(args, new SparkConf())
> >>  ^
> >> Error:(56, 12) not found: type YarnRPC
> >>   var rpc: YarnRPC = YarnRPC.create(conf)
> >>^
> >> Error:(56, 22) not found: value YarnRPC
> >>   var rpc: YarnRPC = YarnRPC.create(conf)
> >>  ^
> >> Error:(57, 17) not found: type YarnConfiguration
> >>   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
> >> ^
> >> Error:(57, 41) not found: type YarnConfiguration
> >>   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
> >> ^
> >> Error:(58, 59) value getCredentials is not a member of
> >> org.apache.hadoop.security.UserGroupInformation
> >>   val credentials =
> UserGroupInformation.getCurrentUser().getCredentials()
> >>   ^
> >> Error:(60, 34) not found: type ClientDistributedCacheManager
> >>   private val distCacheMgr = new ClientDistributedCacheManager()
> >>  ^
> >> Error:(72, 5) not found: value init
> >> init(yarnConf)
> >> ^
> >> Error:(73, 5) not found: value start
> >> start()
> >> ^
> >> Error:(76, 24) value getNewApplication is not a member of
> >> org.apache.spark.Logging
> >> val newApp = super.getNewApplication()
> >>^
> >> Error:(13

Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Dong Mo
: ApplicationSubmissionContext) = {
^
Error:(429, 33) not found: type ApplicationId
  def monitorApplication(appId: ApplicationId): Boolean = {
^
Error:(123, 25) not found: type YarnClusterMetrics
val clusterMetrics: YarnClusterMetrics = super.getYarnClusterMetrics
^
Error:(123, 52) value getYarnClusterMetrics is not a member of
org.apache.spark.Logging
val clusterMetrics: YarnClusterMetrics = super.getYarnClusterMetrics
   ^
Error:(127, 20) not found: type QueueInfo
val queueInfo: QueueInfo = super.getQueueInfo(args.amQueue)
   ^
Error:(127, 38) value getQueueInfo is not a member of
org.apache.spark.Logging
val queueInfo: QueueInfo = super.getQueueInfo(args.amQueue)
 ^
Error:(158, 22) not found: value Records
val appContext =
Records.newRecord(classOf[ApplicationSubmissionContext])
 ^
Error:(219, 14) not found: value FileContext
val fc = FileContext.getFileContext(qualPath.toUri(), conf)
 ^
Error:(230, 29) not found: value Master
val delegTokenRenewer = Master.getMasterPrincipal(conf)
^
Error:(242, 13) value addDelegationTokens is not a member of
org.apache.hadoop.fs.FileSystem
  dstFs.addDelegationTokens(delegTokenRenewer, credentials)
^
Error:(244, 42) not found: type LocalResource
val localResources = HashMap[String, LocalResource]()
 ^
Error:(302, 43) value addCredentials is not a member of
org.apache.hadoop.security.UserGroupInformation
UserGroupInformation.getCurrentUser().addCredentials(credentials)
  ^
Error:(323, 5) not found: value Apps
Apps.setEnvFromInputString(env, System.getenv("SPARK_YARN_USER_ENV"))
^
Error:(330, 36) not found: type ClientArguments
  def userArgsToString(clientArgs: ClientArguments): String = {
   ^
Error:(345, 23) not found: value Records
val amContainer = Records.newRecord(classOf[ContainerLaunchContext])
  ^
Error:(363, 16) not found: value Environment
  new Path(Environment.PWD.$(),
YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR) + " "
   ^
Error:(392, 21) not found: value Environment
  javaCommand = Environment.JAVA_HOME.$() + "/bin/java"
^
Error:(405, 16) not found: value ApplicationConstants
  " 1> " + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout" +
   ^
Error:(410, 22) not found: value Records
val capability =
Records.newRecord(classOf[Resource]).asInstanceOf[Resource]
 ^
Error:(410, 72) not found: type Resource
val capability =
Records.newRecord(classOf[Resource]).asInstanceOf[Resource]
   ^
Error:(434, 26) value getApplicationReport is not a member of
org.apache.spark.Logging
  val report = super.getApplicationReport(appId)
 ^
Error:(474, 20) not found: type ClientArguments
val args = new ClientArguments(argStrings, sparkConf)
   ^
Error:(481, 31) not found: value YarnConfiguration
for (c <-
conf.getStrings(YarnConfiguration.YARN_APPLICATION_CLASSPATH)) {
  ^
Error:(487, 5) not found: value Apps
Apps.addToEnvironment(env, Environment.CLASSPATH.name,
Environment.PWD.$())
^
Error:(490, 7) not found: value Apps
  Apps.addToEnvironment(env, Environment.CLASSPATH.name,
Environment.PWD.$() +
  ^
Error:(496, 7) not found: value Apps
  Apps.addToEnvironment(env, Environment.CLASSPATH.name,
Environment.PWD.$() +
  ^
Error:(499, 5) not found: value Apps
Apps.addToEnvironment(env, Environment.CLASSPATH.name,
Environment.PWD.$() +
^
Error:(504, 7) not found: value Apps
  Apps.addToEnvironment(env, Environment.CLASSPATH.name,
Environment.PWD.$() +
  ^
Error:(507, 5) not found: value Apps
Apps.addToEnvironment(env, Environment.CLASSPATH.name,
Environment.PWD.$() +
^

Any idea what's causing them, and maybe I am not using the best practice to
import Spark to IDE?

I would appreciate any suggestion on the best practice to import Spark to
any IDE.

Thank you

-Mo


Re: how spark dstream handles congestion?

2014-03-31 Thread Dong Mo
Thanks
-Mo


2014-03-31 13:16 GMT-05:00 Evgeny Shishkin :

>
> On 31 Mar 2014, at 21:05, Dong Mo  wrote:
>
> > Dear list,
> >
> > I was wondering how Spark handles congestion when the upstream is
> generating dstreams faster than downstream workers can handle?
>
> It will eventually OOM.
>
>


how spark dstream handles congestion?

2014-03-31 Thread Dong Mo
Dear list,

I was wondering how Spark handles congestion when the upstream is
generating dstreams faster than downstream workers can handle?

Thanks
-Mo


Re: sample data for pagerank?

2014-03-13 Thread Mo
You can find it here:
https://github.com/apache/incubator-spark/tree/master/graphx/data


On Thu, Mar 13, 2014 at 10:13 AM, Diana Carroll wrote:

> I'd like to play around with the Page Rank example included with Spark but
> I can't find that any sample data to work with is included.  Am I missing
> it?  Anyone got a sample file to share?
>
>  Thanks,
> Diana
>